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Preface 


This  is  an  exciting  time  to  be  working  in  speech  and  language  processing. 
Historically  distinct  fields  (natural  language  processing,  speech  recognition, 
computational  linguistics,  computational  psycholinguistics)  have  begun  to 
merge.  The  commercial  availability  of  speech  recognition,  and  the  need 
for  web-based  language  techniques  have  provided  an  important  impetus  for 
development  of  real  systems.  The  availability  of  very  large  on-line  corpora 
has  enabled  statistical  models  of  language  at  every  level,  from  phonetics  to 
discourse.  We  have  tried  to  draw  on  this  emerging  state  of  the  art  in  the 
design  of  this  pedagogical  and  reference  work: 

1.  Coverage 

In  attempting  to  describe  a unified  vision  of  speech  and  language  pro- 
cessing, we  cover  areas  that  traditionally  arc  taught  in  different  courses 
in  different  departments:  speech  recognition  in  electrical  engineering, 
parsing,  semantic  interpretation,  and  pragmatics  in  natural  language 
processing  courses  in  computer  science  departments,  computational 
morphology  and  phonology  in  computational  linguistics  courses  in  lin- 
guistics departments.  The  book  introduces  the  fundamental  algorithms 
of  each  of  these  fields,  whether  originally  proposed  for  spoken  or  writ- 
ten language,  whether  logical  or  statistical  in  origin,  and  attempts  to 
tie  together  the  descriptions  of  algorithms  from  different  domains.  We 
have  also  included  coverage  of  applications  like  spelling  checking  and 
information  retrieval  and  extraction,  as  well  as  to  areas  like  cognitive 
modeling.  A potential  problem  with  this  broad-coverage  approach  is 
that  it  required  us  to  include  introductory  material  for  each  field;  thus 
linguists  may  want  to  skip  our  description  of  articulatory  phonetics, 
computer  scientists  may  want  to  skip  such  sections  as  regular  expres- 
sions, and  electrical  engineers  the  sections  on  signal  processing.  Of 
course,  even  in  a book  this  long,  we  didn’t  have  room  for  everything. 
Thus  this  book  should  not  be  considered  a substitute  for  important  rel- 
evant courses  in  linguistics,  automata  and  formal  language  theory,  or, 
especially,  statistics  and  information  theory. 

2.  Emphasis  on  practical  applications 

It  is  important  to  show  how  language -related  algorithms  and  tech- 
niques (from  HMMs  to  unification,  from  the  lambda  calculus  to 
transformation-based  learning)  can  be  applied  to  important  real-world 
problems:  spelling  checking,  text  document  search,  speech  recogni- 
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tion.  Web-page  processing,  part-of-speech  tagging,  machine  transla- 
tion, and  spoken-language  dialog  agents.  We  have  attempted  to  do  this 
by  integrating  the  description  of  language  processing  applications  into 
each  chapter.  The  advantage  of  this  approach  is  that  as  the  relevant 
linguistic  knowledge  is  introduced,  the  student  has  the  background  to 
understand  and  model  a particular'  domain. 

3.  Emphasis  on  scientific  evaluation 

The  recent  prevalence  of  statistical  algorithms  in  language  processing, 
and  the  growth  of  organized  evaluations  of  speech  and  language  pro- 
cessing systems  has  led  to  a new  emphasis  on  evaluation.  We  have, 
therefore,  tried  to  accompany  most  of  our  problem  domains  with  a 
Methodology  Box  describing  how  systems  are  evaluated  (e.g.  in- 
cluding such  concepts  as  training  and  test  sets,  cross-validation,  and 
information-theoretic  evaluation  metrics  like  perplexity). 

4.  Description  of  widely  available  language  processing  resources 
Modern  speech  and  language  processing  is  heavily  based  on  com- 
mon resources:  raw  speech  and  text  corpora,  annotated  corpora  and 
treebanks,  standard  tagsets  for  labeling  pronunciation,  part  of  speech, 
parses,  word-sense,  and  dialog-level  phenomena.  We  have  tried  to  in- 
troduce many  of  these  important  resources  throughout  the  book  (for  ex- 
ample the  Brown,  Switchboard,  CALLHOME,  ATIS,  TREC,  MUC,  and 
BNC  corpora),  and  provide  complete  listings  of  many  useful  tagsets 
and  coding  schemes  (such  as  the  Penn  Treebank,  CLAWS  C5  and  Cl, 
and  the  ARPAbet)  but  some  inevitably  got  left  out.  Furthermore,  rather 
than  include  references  to  URLs  for  many  resources  directly  in  the 
textbook,  we  have  placed  them  on  the  book’s  web  site,  where  they  can 
more  readily  updated. 

The  book  is  primarily  intended  for  use  in  a graduate  or  advanced  under- 
graduate course  or  sequence.  Because  of  its  comprehensive  coverage  and  the 
large  number  of  algorithms,  the  book  it  also  useful  as  a reference  for  students 
and  professionals  in  any  of  the  areas  of  speech  and  language  processing. 

Overview  of  the  book 

The  book  is  divided  into  4 parts  in  addition  to  an  introduction  and  end  matter. 
Part  I,  “Words”,  introduces  concepts  related  to  the  processing  of  words:  pho- 
netics, phonology,  morphology,  and  algorithms  used  to  process  them:  finite 
automata,  finite  transducers,  weighted  transducers,  N-grams,  and  Hidden 
Markov  Models.  Part  II,  “Syntax”,  introduces  parts-of-speech  and  phrase 
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structure  grammars  for  English,  and  gives  essential  algorithms  for  process- 
ing word  classes  and  structured  relationships  among  words:  part-of-speech 
taggers  based  on  HMMs  and  transformation-based  learning,  the  CYK  and 
Earley  algorithms  for  parsing,  unification  and  typed  feature  structures,  lex- 
icalized  and  probabilistic  parsing,  and  analytical  tools  like  the  Chomsky 
hierarchy  and  the  pumping  lemma.  Paid  III,  “Semantics”,  introduces  first 
order  predicate  calculus  and  other  ways  of  representing  meaning,  several 
approaches  to  compositional  semantic  analysis,  along  with  applications  to 
information  retrieval,  information  extraction,  speech  understanding,  and  ma- 
chine translation.  Paid  IV,  “Pragmatics”,  covers  reference  resolution  and  dis- 
course structure  and  coherence,  spoken  dialog  phenomena  like  dialog  and 
speech  act  modeling,  dialog  structure  and  coherence,  and  dialog  managers, 
as  well  as  a comprehensive  treatment  of  natural  language  generation  and  of 
machine  translation. 

Using  this  book 

The  book  provides  enough  material  to  be  used  for  a full  year  sequence  in 
speech  and  language  processing.  It  is  also  designed  so  that  it  can  be  used  for 
a number  of  different  useful  one-term  courses: 


NLP 

NLP 

Speech  + NLP 

Comp.  Linguistics 

1 quarter 

1 semester 

1 semester 

1 quarter 

1.  Intro 

1.  Intro 

1.  Intro 

1.  Intro 

2.  Regex,  FSA 

2.  Regex,  FSA 

2.  Regex,  FSA 

2.  Regex,  FSA 

8.  POS  tagging 

3.  Morph.,  FST 

3.  Morph.,  FST 

3.  Morph.,  FST 

9.  CFGs 

6.  N-grams 

4.  Comp.  Phonol. 

4.  Comp.  Phonol. 

10.  Parsing 

8.  POS  tagging 

5.  Prob.  Pronun. 

10.  Parsing 

11.  Unification 

9.  CFGs 

6.  N-grams 

11.  Unification 

14.  Semantics 

10.  Parsing 

7.  HMMs  & ASR 

13.  Complexity 

15.  Sem.  Analysis 

11.  Unification 

8.  POS  tagging 

16.  Lex.  Semantics 

18.  Discourse 

12.  Prob.  Parsing 

9.  CFG 

18.  Discourse 

20.  Generation 

14.  Semantics 

10.  Parsing 

19.  Dialog 

15.  Sem.  Analysis 

12.  Prob  Parsing 

16.  Lex.  Semantics 

14.  Semantics 

18.  Discourse 

15.  Sem.  Analysis 

19.  WSD  and  IR 

19.  Dialog 

20.  Generation 

21.  Machine  Transl. 

21.  Machine  Transl. 

Selected  chapters  from  the  book  could  also  be  used  to  augment  courses 
in  Artificial  Intelligence,  Cognitive  Science,  or  Information  Retrieval. 
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Dave  Bowman:  Open  the  pod  bay  doors,  HAL. 

HAL:  I’m  sorry  Dave,  I’m  afraid  I can ’t  do  that. 

Stanley  Kubrick  and  Arthur  C.  Clarke, 
screenplay  of  2001:  A Space  Odyssey 


The  HAL  9000  computer  in  Stanley  Kubrick’s  film  2001:  A Space 
Odyssey  is  one  of  the  most  recognizable  characters  in  twentieth-century 
cinema.  HAL  is  an  artificial  agent  capable  of  such  advanced  language- 
processing behavior  as  speaking  and  understanding  English,  and  at  a crucial 
moment  in  the  plot,  even  reading  lips.  It  is  now  clear  that  HAL’s  creator 
Arthur  C.  Clarke  was  a little  optimistic  in  predicting  when  an  artificial  agent 
such  as  HAL  would  be  available.  But  just  how  far  off  was  he?  What  would 
it  take  to  create  at  least  the  language-related  parts  of  HAL?  Minimally,  such 
an  agent  would  have  to  be  capable  of  interacting  with  humans  via  language, 
which  includes  understanding  humans  via  speech  recognition  and  natural 
language  understanding  (and  of  course  lip-reading),  and  of  communicat- 
ing with  humans  via  natural  language  generation  and  speech  synthesis. 
HAL  would  also  need  to  be  able  to  do  information  retrieval  (finding  out 
where  needed  textual  resources  reside),  information  extraction  (extracting 
pertinent  facts  from  those  textual  resources),  and  inference  (drawing  con- 
clusions based  on  known  facts). 

Although  these  problems  are  far  from  completely  solved,  much  of  the 
language-related  technology  that  HAL  needs  is  currently  being  developed, 
with  some  of  it  already  available  commercially.  Solving  these  problems, 
and  others  like  them,  is  the  main  concern  of  the  fields  known  as  Natural 
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Language  Processing,  Computational  Linguistics  and  Speech  Recognition 
and  Synthesis,  which  together  we  call  Speech  and  Language  Processing. 
The  goal  of  this  book  is  to  describe  the  state  of  the  art  of  this  technology 
at  the  start  of  the  twenty-first  century.  The  applications  we  will  consider 
arc  all  of  those  needed  for  agents  like  HAL,  as  well  as  other  valuable  areas 
of  language  processing  such  as  spelling  correction,  grammar  checking, 
information  retrieval,  and  machine  translation. 


1 . 1 Knowledge  in  Speech  and  Language  Processing 

By  speech  and  language  processing,  we  have  in  mind  those  computational 
techniques  that  process  spoken  and  written  human  language,  as  language. 
As  we  will  see,  this  is  an  inclusive  definition  that  encompasses  everything 
from  mundane  applications  such  as  word  counting  and  automatic  hyphen- 
ation, to  cutting  edge  applications  such  as  automated  question  answering  on 
the  Web,  and  real-time  spoken  language  translation. 

What  distinguishes  these  language  processing  applications  from  other 
data  processing  systems  is  their  use  of  knowledge  of  language.  Consider  the 
Unix  wc  program,  which  is  used  to  count  the  total  number  of  bytes,  words, 
and  lines  in  a text  file.  When  used  to  count  bytes  and  lines,  wc  is  an  ordinary 
data  processing  application.  However,  when  it  is  used  to  count  the  words 
in  a file  it  requires  knowledge  about  what  it  means  to  be  a word , and  thus 
becomes  a language  processing  system. 

Of  course,  wc  is  an  extremely  simple  system  with  an  extremely  lim- 
ited and  impoverished  knowledge  of  language.  More  sophisticated  language 
agents  such  as  HAL  require  much  broader  and  deeper  knowledge  of  lan- 
guage. To  get  a feeling  for  the  scope  and  kind  of  knowledge  required  in 
more  sophisticated  applications,  consider  some  of  what  HAL  would  need  to 
know  to  engage  in  the  dialogue  that  begins  this  chapter. 

To  determine  what  Dave  is  saying,  HAL  must  be  capable  of  analyzing 
an  incoming  audio  signal  and  recovering  the  exact  sequence  of  words  Dave 
used  to  produce  that  signal.  Similarly,  in  generating  its  response,  HAL  must 
be  able  to  take  a sequence  of  words  and  generate  an  audio  signal  that  Dave 
can  recognize.  Both  of  these  tasks  require  knowledge  about  phonetics  and 
phonology,  which  can  help  model  how  words  arc  pronounced  in  colloquial 
speech  (Chapter  4 and  Chapter  5). 

Note  also  that  unlike  Star  Trek’s  Commander  Data,  HAL  is  capable  of 
producing  contractions  like  I’m  and  can’t.  Producing  and  recognizing  these 
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and  other  variations  of  individual  words  (for  example  recognizing  that  doors 
is  plural)  requires  knowledge  about  morphology,  which  captures  informa- 
tion about  the  shape  and  behavior  of  words  in  context  (Chapter  2,  Chapter  3). 

Moving  beyond  individual  words,  HAL  must  know  how  to  analyze  the 
structure  underlying  Dave’s  request.  Such  an  analysis  is  necessary  among 
other  reasons  for  HAL  to  determine  that  Dave’s  utterance  is  a request  for 
action,  as  opposed  to  a simple  statement  about  the  world  or  a question  about 
the  door,  as  in  the  following  variations  of  his  original  statement. 

HAL,  the  pod  bay  door  is  open. 

HAL,  is  the  pod  bay  door  open? 

In  addition,  HAL  must  use  similar  structural  knowledge  to  properly  string 
together  the  words  that  constitute  its  response.  For  example,  HAL  must 
know  that  the  following  sequence  of  words  will  not  make  sense  to  Dave, 
despite  the  fact  that  it  contains  precisely  the  same  set  of  words  as  the  original. 

I'm  I do,  sony  that  afraid  Dave  I'm  can’t. 

The  knowledge  needed  to  order  and  group  words  together  comes  under  the 
heading  of  syntax. 

Of  course,  simply  knowing  the  words  and  the  syntactic  structure  of 
what  Dave  said  does  not  tell  HAL  much  about  the  nature  of  his  request. 
To  know  that  Dave’s  command  is  actually  about  opening  the  pod  bay  door, 
rather  than  an  inquiry  about  the  day’s  lunch  menu,  requires  knowledge  of 
the  meanings  of  the  component  words,  the  domain  of  lexical  semantics, 
and  knowledge  of  how  these  components  combine  to  form  larger  meanings, 
compositional  semantics. 

Next,  despite  its  bad  behavior,  HAL  knows  enough  to  be  polite  to 
Dave.  It  could,  for  example,  have  simply  replied  No  or  No,  I won ’t  open 
the  door.  Instead,  it  first  embellishes  its  response  with  the  phrases  I’m  sorry 
and  I’m  afraid , and  then  only  indirectly  signals  its  refusal  by  saying  I can’t, 
rather  than  the  more  direct  (and  truthful)  I won ’t. 1 The  appropriate  use  of  this 
kind  of  polite  and  indirect  language  comes  under  the  heading  of  pragmatics. 

Finally,  rather  than  simply  ignoring  Dave’s  command  and  leaving  the 
door  closed,  HAL  chooses  to  engage  in  a structured  conversation  relevant 
to  Dave’s  initial  request.  HAL's  correct  use  of  the  word  that  in  its  answer 
to  Dave’s  request  is  a simple  illustration  of  the  kind  of  between-utterance 

1 For  those  unfamiliar  with  HAL,  it  is  neither  sorry  nor  afraid,  nor  is  it  incapable  of  opening 
the  door.  It  has  simply  decided  in  a fit  of  paranoia  to  kill  its  crew. 
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device  common  in  such  conversations.  Correctly  structuring  these  such  con- 
versations requires  knowledge  of  discourse  conventions. 

To  summarize,  the  knowledge  of  language  needed  to  engage  in  com- 
plex language  behavior  can  be  separated  into  six  distinct  categories. 

• Phonetics  and  Phonology  - The  study  of  linguistic  sounds. 

• Morphology  - The  study  of  the  meaningful  components  of  words. 

• Syntax  - The  study  of  the  structural  relationships  between  words. 

• Semantics  - The  study  of  meaning. 

• Pragmatics  - The  study  of  how  language  is  used  to  accomplish  goals. 

• Discourse  - The  study  of  linguistic  units  larger  than  a single  utterance. 


1.2  Ambiguity 

A perhaps  surprising  fact  about  the  six  categories  of  linguistic  knowledge  is 
that  most  or  all  tasks  in  speech  and  language  processing  can  be  viewed  as 
ambiguity  resolving  ambiguity  at  one  of  these  levels.  We  say  some  input  is  ambiguous 
if  there  arc  multiple  alternative  linguistic  structures  than  can  be  built  for  it. 
Consider  the  spoken  sentence  I made  her  duck.  Here’s  five  different  mean- 
ings this  sentence  could  have  (there  arc  more)  each  of  which  exemplifies  an 
ambiguity  at  some  level: 

(1.1)  I cooked  waterfowl  for  her. 

(1.2)  I cooked  waterfowl  belonging  to  her. 

(1.3)  I created  the  (plaster?)  duck  she  owns. 

(1.4)  I caused  her  to  quickly  lower  her  head  or  body. 

(1.5)  I waved  my  magic  wand  and  turned  her  into  undifferentiated 
waterfowl. 

These  different  meanings  are  caused  by  a number  of  ambiguities.  First,  the 
words  duck  and  her  arc  morphologically  or  syntactically  ambiguous  in  their 
part  of  speech.  Duck  can  be  a verb  or  a noun,  while  her  can  be  a dative 
pronoun  or  a possessive  pronoun.  Second,  the  word  make  is  semantically 
ambiguous;  it  can  mean  create  or  cook.  Finally,  the  verb  make  is  syntac- 
tically ambiguous  in  a different  way.  Make  can  be  transitive,  i.e.  taking  a 
single  direct  object  (1.2),  or  it  can  be  ditransitive,  i.e.  taking  two  objects 

(1.5) ,  meaning  that  the  first  object  (her)  got  made  into  the  second  object 
(duck).  Finally,  make  can  take  a direct  object  and  a verb  (1.4),  meaning  that 
the  object  (her)  got  caused  to  perform  the  verbal  action  (duck).  Furthermore, 
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in  a spoken  sentence,  there  is  an  even  deeper  kind  of  ambiguity;  the  first 
word  could  have  been  eye  or  the  second  word  maid. 

We  will  often  introduce  the  models  and  algorithms  we  present  through- 
out the  book  as  ways  to  resolve  these  ambiguities.  For  example  deciding 
whether  duck  is  a verb  or  a noun  can  be  solved  by  part  of  speech  tagging. 
Deciding  whether  make  means  ‘create’  or  ‘cook’  can  be  solved  by  word 
sense  disambiguation.  Deciding  whether  her  and  duck  arc  paid  of  the  same 
entity  (as  in  (1.1)  or  (1.4))  or  are  different  entity  (as  in  (1.2))  can  be  solved 
by  probabilistic  parsing.  Ambiguities  that  don’t  arise  in  this  particular'  ex- 
ample (like  whether  a given  sentence  is  a statement  or  a question)  will  also 
be  resolved,  for  example  by  speech  act  interpretation. 


1.3  Models  and  Algorithms 

One  of  the  key  insights  of  the  last  fifty  years  of  research  in  language  pro- 
cessing is  that  the  various  kinds  of  knowledge  described  in  the  last  sections 
can  be  captured  through  the  use  of  a small  number  of  formal  models,  or  the- 
ories. Fortunately,  these  models  and  theories  arc  all  drawn  from  the  standard 
toolkits  of  Computer  Science,  Mathematics,  and  Linguistics  and  should  be 
generally  familiar'  to  those  trained  in  those  fields.  Among  the  most  important 
elements  in  this  toolkit  are  state  machines,  formal  rule  systems,  logic,  as 
well  as  probability  theory  and  other  machine  learning  tools.  These  mod- 
els, in  turn,  lend  themselves  to  a small  number  of  algorithms  from  well- 
known  computational  paradigms.  Among  the  most  important  of  these  are 
state  space  search  algorithms  and  dynamic  programming  algorithms. 

In  their  simplest  formulation,  state  machines  are  formal  models  that 
consist  of  states,  transitions  among  states,  and  an  input  representation.  Among 
the  variations  of  this  basic  model  that  we  will  consider  are  deterministic  and 
non-deterministic  finite-state  automata,  finite-state  transducers,  which 
can  write  to  an  output  device,  weighted  automata,  Markov  models  and 
hidden  Markov  models  which  have  a probabilistic  component. 

Closely  related  to  these  somewhat  procedural  models  are  their  declar- 
ative counterparts:  formal  rule  systems.  Among  the  more  important  ones  we 
will  consider  are  regular  grammars  and  regular  relations,  context-free 
grammars,  feature-augmented  grammars,  as  well  as  probabilistic  vari- 
ants of  them  all.  State  machines  and  formal  rule  systems  are  the  main  tools 
used  when  dealing  with  knowledge  of  phonology,  morphology,  and  syntax. 

The  algorithms  associated  with  both  state-machines  and  formal  rule 
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systems  typically  involve  a search  through  a space  of  states  representing  hy- 
potheses about  an  input.  Representative  tasks  include  searching  through  a 
space  of  phonological  sequences  for  a likely  input  word  in  speech  recog- 
nition, or  searching  through  a space  of  trees  for  the  correct  syntactic  parse 
of  an  input  sentence.  Among  the  algorithms  that  arc  often  used  for  these 
tasks  arc  well-known  graph  algorithms  such  as  depth-first  search,  as  well 
as  heuristic  valiants  such  as  best-first,  and  A*  search.  The  dynamic  pro- 
gramming paradigm  is  critical  to  the  computational  tractability  of  many  of 
these  approaches  by  ensuring  that  redundant  computations  arc  avoided. 

The  third  model  that  plays  a critical  role  in  capturing  knowledge  of 
language  is  logic.  We  will  discuss  first  order  logic,  also  known  as  the  pred- 
icate calculus,  as  well  as  such  related  formalisms  as  feature-structures,  se- 
mantic networks,  and  conceptual  dependency.  These  logical  representations 
have  traditionally  been  the  tool  of  choice  when  dealing  with  knowledge  of 
semantics,  pragmatics,  and  discourse  (although,  as  we  will  see,  applications 
in  these  areas  arc  increasingly  relying  on  the  simpler  mechanisms  used  in 
phonology,  morphology,  and  syntax). 

Probability  theory  is  the  final  element  in  our  set  of  techniques  for  cap- 
turing linguistic  knowledge.  Each  of  the  other  models  (state  machines,  for- 
mal rule  systems,  and  logic)  can  be  augmented  with  probabilities.  One  major 
use  of  probability  theory  is  to  solve  the  many  kinds  of  ambiguity  problems 
that  we  discussed  earlier;  almost  any  speech  and  language  processing  prob- 
lem can  be  recast  as:  ‘given  N choices  for  some  ambiguous  input,  choose 
the  most  probable  one’. 

Another  major  advantage  of  probabilistic  models  is  that  they  arc  one  of 
a class  of  machine  learning  models.  Machine  learning  research  has  focused 
on  ways  to  automatically  learn  the  various  representations  described  above; 
automata,  rule  systems,  search  heuristics,  classifiers.  These  systems  can  be 
trained  on  large  corpora  and  can  be  used  as  a powerful  modeling  technique, 
especially  in  places  where  we  don’t  yet  have  good  causal  models.  Machine 
learning  algorithms  will  be  described  throughout  the  book. 


1.4  Language,  Thought,  and  Understanding 

To  many,  the  ability  of  computers  to  process  language  as  skillfully  as  we  do 
will  signal  the  arrival  of  truly  intelligent  machines.  The  basis  of  this  belief  is 
the  fact  that  the  effective  use  of  language  is  intertwined  with  our  general  cog- 
nitive abilities.  Among  the  first  to  consider  the  computational  implications 
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of  this  intimate  connection  was  Alan  Turing  (1950).  In  this  famous  paper, 

Turing  introduced  what  has  come  to  be  known  as  the  Turing  Test.  Turing  Turing  test 
began  with  the  thesis  that  the  question  of  what  it  would  mean  for  a machine 
to  think  was  essentially  unanswerable  due  to  the  inherent  imprecision  in  the 
terms  machine  and  think.  Instead,  he  suggested  an  empirical  test,  a game, 
in  which  a computer’s  use  of  language  would  form  the  basis  for  determin- 
ing if  it  could  think.  If  the  machine  could  win  the  game  it  would  be  judged 
intelligent. 

In  Turing’s  game,  there  arc  three  participants:  2 people  and  a computer. 

One  of  the  people  is  a contestant  and  plays  the  role  of  an  interrogator.  To 
win,  the  interrogator  must  determine  which  of  the  other  two  participants  is 
the  machine  by  asking  a series  of  questions  via  a teletype.  The  task  of  the 
machine  is  to  fool  the  interrogator  into  believing  it  is  a person  by  responding 
as  a person  would  to  the  interrogator’s  questions.  The  task  of  the  second 
human  participant  is  to  convince  the  interrogator  that  the  other  participant  is 
the  machine,  and  that  they  arc  human. 

The  following  interaction  from  Turing’s  paper  illustrates  the  kind  of 
interactions  he  had  in  mind.  Clearly  a convincing  simulation  of  humanity 
doesn’t  require  being  an  expert  in  all  domains: 

Q:  Please  write  me  a sonnet  on  the  topic  of  the  Forth  Bridge. 

A:  Count  me  out  on  this  one.  I never  could  write  poetry 

Q:  Add  34957  to  70764 

A:  (Pause  about  30  seconds  and  then  give  answer  as)  105621. 


Turing  predicted  that  roughly  by  the  end  of  the  twentieth  century  a machine 
with  10  gigabytes  of  memory  would  have  around  a 30%  chance  of  fooling  a 
human  interrogator  after  5 minutes  of  questioning.  Given  the  fact  that  you 
can  fool  some  of  the  people  all  the  time,  it  is  not  clear-  how  rigorous  this 
particular-  standard  is.  Nevertheless,  the  critical  issue  for  Turing  was  that 
using  language  as  humans  do  is  sufficient,  by  itself,  as  an  operational  test  for 
intelligence. 

Surprisingly,  many  of  the  issues  raised  by  Turing  became  relevant  as 
early  as  1966  in  the  context  of  the  ELIZA  program  (Weizenbaum,  1966). 
ELIZA  was  an  early  natural  language  processing  system  capable  of  carry- 
ing on  a limited  form  of  conversation  with  a user.  Consider  the  following 
session  with  a version  of  ELIZA  that  imitated  the  responses  of  a Rogerian 
psychotherapist. 
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Usci'i : You  arc  like  my  father  in  some  ways. 

ELIZA i : WHAT  RESEMBLANCE  DO  YOU  SEE 

User2:  You  arc  not  very  aggressive  but  I think  you  don’t  want  me  to  notice  that. 

ELIZAt:  WHAT  MAKES  YOU  THINK  I AM  NOT  AGGRESSIVE 

Usei'3:  You  don’t  argue  with  me. 

ELIZA3 : WHY  DO  YOU  THINK  I DON’T  ARGUE  WITH  YOU 

Usci'4:  You  arc  afraid  of  me. 

ELIZA4:  DOES  IT  PLEASE  YOU  TO  BELIEVE  I AM  AFRAID  OF  YOU 

As  we  will  discuss  in  more  detail  in  Chapter  2,  ELIZA  is  a remarkably  sim- 
ple program  that  makes  use  of  pattern-matching  to  process  the  input  and 
translate  it  into  suitable  outputs.  The  success  of  this  simple  technique  in  this 
domain  is  due  to  the  fact  that  ELIZA  doesn’t  actually  need  to  know  anything 
to  mimic  a Rogerian  psychotherapist.  As  Weizenbaum  notes,  this  is  one  of 
the  few  dialogue  genres  where  the  listener  can  act  as  if  they  know  nothing  of 
the  world. 

ELIZA  deep  relevance  to  Turing’s  ideas  is  that  many  people  who  in- 
teracted with  ELIZA  came  to  believe  that  it  really  understood  them  and  their 
problems.  Indeed,  Weizenbaum  (1976)  notes  that  many  of  these  people  con- 
tinued to  believe  in  ELIZA’s  abilities  even  after  the  program’s  operation  was 
explained  to  them.  In  more  recent  years,  Weizenbaum’s  informal  reports 
have  been  repeated  in  a somewhat  more  controlled  setting.  Since  1991,  an 
event  known  as  the  Loebner  Prize  competition  has  attempted  to  put  various 
computer  programs  to  the  Turing  test.  Although  these  contests  have  proven 
to  have  little  scientific  interest,  a consistent  result  over  the  years  has  been 
that  even  the  crudest  programs  can  fool  some  of  the  judges  some  of  the  time 
(Shieber,  1994).  Not  surprisingly,  these  results  have  done  nothing  to  quell 
the  ongoing  debate  over  the  suitability  of  the  Turing  test  as  a test  for  intelli- 
gence among  philosophers  and  AI  researchers  (Searle,  1980). 

Fortunately,  for  the  purposes  of  this  book,  the  relevance  of  these  results 
does  not  hinge  on  whether  or  not  computers  will  ever  be  intelligent,  or  un- 
derstand natural  language.  Far  more  important  is  recent  related  research  in 
the  social  sciences  that  has  confirmed  another  of  Turing’s  predictions  from 
the  same  paper. 

Nevertheless  I believe  that  at  the  end  of  the  century  the  use  of 
words  and  educated  opinion  will  have  altered  so  much  that  we 
will  be  able  to  speak  of  machines  thinking  without  expecting  to 
be  contradicted. 

It  is  now  cleai-  that  regardless  of  what  people  believe  or  know  about  the  in- 
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ner  workings  of  computers,  they  talk  about  them  and  interact  with  them  as 
social  entities.  People  act  toward  computers  as  if  they  were  people;  they  arc 
polite  to  them,  treat  them  as  team  members,  and  expect  among  other  things 
that  computers  should  be  able  to  understand  their  needs,  and  be  capable  of 
interacting  with  them  naturally.  For  example,  Reeves  and  Nass  (1996)  found 
that  when  a computer  asked  a human  to  evaluate  how  well  the  computer  had 
been  doing,  the  human  gives  more  positive  responses  than  when  a different 
computer  asks  the  same  questions.  People  seemed  to  be  afraid  of  being  im- 
polite. In  a different  experiment,  Reeves  and  Nass  found  that  people  also 
give  computers  higher  performance  ratings  if  the  computer  has  recently  said 
something  flattering  to  the  human.  Given  these  predispositions,  speech  and 
language -based  systems  may  provide  many  users  with  the  most  natural  in- 
terface for  many  applications.  This  fact  has  led  to  a long-term  focus  in  the 
field  on  the  design  of  conversational  agents,  artificial  entities  which  com- 
municate conversationally. 


1.5  The  State  of  the  Art  and  The  Near-Term 
Future 

We  can  only  see  a short  distance  ahead,  but  we  can  see  plenty 

there  that  needs  to  be  done. 

- Alan  Turing. 

This  is  an  exciting  time  for  the  field  of  speech  and  language  processing. 
The  recent  commercialization  of  robust  speech  recognition  systems,  and  the 
rise  of  the  World-Wide  Web,  have  placed  speech  and  language  processing 
applications  in  the  spotlight,  and  have  pointed  out  a plethora  of  exciting  pos- 
sible applications.  The  following  scenarios  serve  to  illustrate  some  current 
applications  and  near-term  possibilities. 

A Canadian  computer  program  accepts  daily  weather  data  and  gener- 
ates weather  reports  that  arc  passed  along  unedited  to  the  public  in  English 
and  French  (Chandioux,  1976). 

The  Babel  Fish  translation  system  from  Systran  handles  over  1,000,000 
translation  requests  a day  from  the  AltaVista  search  engine  site. 

A visitor  to  Cambridge,  Massachusetts,  asks  a computer  about  places 
to  eat  using  only  spoken  language.  The  system  returns  relevant  information 
from  a database  of  facts  about  the  local  restaurant  scene  (Zue  et  at,  1991). 

These  scenarios  represent  just  a few  of  applications  possible  given  cur- 
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rent  technology.  The  following,  somewhat  more  speculative  scenarios,  give 
some  feeling  for  applications  currently  being  explored  at  research  and  devel- 
opment labs  around  the  world. 

A computer  reads  hundreds  of  typed  student  essays  and  assigns  grades 
to  them  in  a manner  that  is  indistinguishable  from  human  graders  (Landauer 
etal,  1997). 

A satellite  operator  uses  language  to  ask  questions  and  commands  to  a 
computer  that  controls  a world-wide  network  of  satellites  (?). 

German  and  Japanese  entrepreneurs  negotiate  a time  and  place  to  meet 
in  their  own  languages  using  small  hand-held  communication  devices  (?). 

Closed-captioning  is  provided  in  in  any  of  a number  of  languages  for 
a broadcast  news  program  by  a computer  listening  to  the  audio  signal  (?). 

A computer  equipped  with  a vision  system  watches  a professional  soc- 
cer game  and  provides  an  automated  natural  language  account  of  the  game 
(?). 

1.6  Some  Brief  History 

Historically,  speech  and  language  processing  has  been  treated  very  differ- 
ently in  computer  science,  electrical  engineering,  linguistics,  and  psychol- 
ogy/cognitive science.  Because  of  this  diversity,  speech  and  language  pro- 
cessing encompasses  a number  of  different  but  overlapping  fields  in  these 
different  departments:  computational  linguistics  in  linguistics,  natural  lan- 
guage processing  in  computer  science,  speech  recognition  in  electrical  en- 
gineering, computational  psycholinguistics  in  psychology.  This  section 
summarizes  the  different  historical  threads  which  have  given  rise  to  the  field 
of  speech  and  language  processing.  This  section  will  provide  only  a sketch; 
the  individual  chapters  will  provide  more  detail  on  each  area. 

Foundational  Insights:  1940’s  and  1950’s 

The  earliest  roots  of  the  field  date  to  the  intellectually  fertile  period  just 
after  World  War  II  which  gave  rise  to  the  computer  itself.  This  period 
from  the  1940s  through  the  end  of  the  1950s  saw  intense  work  on  two 
foundational  paradigms:  the  automaton  and  probabilistic  or  information- 
theoretic  models. 

The  automaton  arose  in  the  1950s  out  of  Turing’s  (1950)  model  of 
algorithmic  computation,  considered  by  many  to  be  the  foundation  of  mod- 
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ern  computer  science.  Turing’s  work  led  to  the  McCulloch-Pitts  neuron 
(McCulloch  and  Pitts,  1943),  a simplified  model  of  the  neuron  as  a kind  of 
computing  element  that  could  be  described  in  terms  of  propositional  logic, 
and  then  to  the  work  of  Kleene  (1951)  and  (1956)  on  finite  automata  and  reg- 
ular- expressions.  Automata  theory  was  contributed  to  by  Shannon  (1948), 
who  applied  probabilistic  models  of  discrete  Markov  processes  to  automata 
for  language.  Drawing  the  idea  of  a finite-state  Markov  process  from  Shan- 
non’s work,  Chomsky  (1956)  first  considered  finite-state  machines  as  a way 
to  characterize  a grammar,  and  defined  a finite-state  language  as  a language 
generated  by  a finite-state  grammar.  These  early  models  led  to  the  field  of 
formal  language  theory,  which  used  algebra  and  set  theory  to  define  formal 
languages  as  sequences  of  symbols.  This  includes  the  context-free  grammar, 
first  defined  by  Chomsky  (1956)  for  natural  languages  but  independently  dis- 
covered by  Backus  (1959)  and  Naur  et  ah  (1960)  in  their  descriptions  of  the 
ALGOL  programming  language. 

The  second  foundational  insight  of  this  period  was  the  development  of 
probabilistic  algorithms  for  speech  and  language  processing,  which  dates  to 
Shannon’s  other  contribution:  the  metaphor  of  the  noisy  channel  and  de- 
coding for  the  transmission  of  language  through  media  like  communication 
channels  and  speech  acoustics.  Shannon  also  borrowed  the  concept  of  en- 
tropy from  thermodynamics  as  a way  of  measuring  the  information  capacity 
of  a channel,  or  the  information  content  of  a language,  and  performed  the 
first  measure  of  the  entropy  of  English  using  probabilistic  techniques. 

It  was  also  during  this  early  period  that  the  sound  spectrograph  was 
developed  (Koenig  et  ah,  1946),  and  foundational  research  was  done  in  in- 
strumental phonetics  that  laid  the  groundwork  for  later  work  in  speech  recog- 
nition. This  led  to  the  first  machine  speech  recognizers  in  the  early  1950’s. 
In  1952,  researchers  at  Bell  Labs  built  a statistical  system  that  could  rec- 
ognize any  of  the  10  digits  from  a single  speaker  (Davis  et  al.,  1952).  The 
system  had  10  speaker-dependent  stored  patterns  roughly  representing  the 
first  two  vowel  formants  in  the  digits.  They  achieved  97-99%  accuracy  by 
choosing  the  pattern  which  had  the  highest  relative  correlation  coefficient 
with  the  input. 

The  Two  Camps:  1957-1970 

By  the  end  of  the  1950s  and  the  early  1960s,  speech  and  language  processing 
had  split  very  cleanly  into  two  paradigms:  symbolic  and  stochastic. 

The  symbolic  paradigm  took  off  from  two  lines  of  research.  The  first 
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was  the  work  of  Chomsky  and  others  on  formal  language  theory  and  gen- 
erative syntax  throughout  the  late  1950’s  and  early  to  mid  1960's,  and  the 
work  of  many  linguistics  and  computer  scientists  on  parsing  algorithms,  ini- 
tially top-down  and  bottom-up,  and  then  via  dynamic  programming.  One 
of  the  earliest  complete  parsing  systems  was  Zelig  Harris’s  Transformations 
and  Discourse  Analysis  Project  (TDAP),  which  was  implemented  between 
June  1958  and  July  1959  at  the  University  of  Pennsylvania  (Harris,  1962). 2 
The  second  line  of  research  was  the  new  field  of  artificial  intelligence.  In 
the  summer  of  1956  John  McCarthy,  Marvin  Minsky,  Claude  Shannon,  and 
Nathaniel  Rochester  brought  together  a group  of  researchers  for  a two  month 
workshop  on  what  they  decided  to  call  artificial  intelligence.  Although  AI  al- 
ways included  a minority  of  researchers  focusing  on  stochastic  and  statistical 
algorithms  (include  probabilistic  models  and  neural  nets),  the  major  focus  of 
the  new  field  was  the  work  on  reasoning  and  logic  typified  by  Newell  and 
Simon’s  work  on  the  Logic  Theorist  and  the  General  Problem  Solver.  At  this 
point  early  natural  language  understanding  systems  were  built.  These  were 
simple  systems  which  worked  in  single  domains  mainly  by  a combination 
of  pattern  matching  and  key-word  search  with  simple  heuristics  for  reason- 
ing and  question-answering.  By  the  late  1960’s  more  formal  logical  systems 
were  developed. 

The  stochastic  paradigm  took  hold  mainly  in  departments  of  statistics 
and  of  electrical  engineering.  By  the  late  1950's  the  Bayesian  method  was 
beginning  to  be  applied  to  to  the  problem  of  optical  character  recognition. 
Bledsoe  and  Browning  (1959)  built  a Bayesian  system  for  text-recognition 
that  used  a large  dictionary  and  computed  the  likelihood  of  each  observed  let- 
ter sequence  given  each  word  in  the  dictionary  by  multiplying  the  likelihoods 
for  each  letter.  Mosteller  and  Wallace  (1964)  applied  Bayesian  methods  to 
the  problem  of  authorship  attribution  on  The  Federalist  papers. 

The  1960s  also  saw  the  rise  of  the  first  serious  testable  psychological 
models  of  human  language  processing  based  on  transformational  grammar, 
as  well  as  the  first  online  corpora:  the  Brown  corpus  of  American  English, 
a 1 million  word  collection  of  samples  from  500  written  texts  from  different 
genres  (newspaper,  novels,  non-fiction,  academic,  etc.),  which  was  assem- 
bled at  Brown  University  in  1963-64  (Kucera  and  Francis,  1967;  Francis, 
1979;  Francis  and  Kucera,  1982),  and  William  S.  Y.  Wang’s  1967  DOC  (Dic- 


2 This  system  was  reimplemented  recently  and  is  described  by  Joshi  and  Hopely  (1999) 
and  Karttunen  (1999),  who  note  that  the  parser  was  essentially  implemented  as  a cascade  of 
finite-state  transducer. 
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tionary  on  Computer),  an  on-line  Chinese  dialect  dictionary. 

Four  Paradigms:  1970-1983 

The  next  period  saw  an  explosion  in  research  in  speech  and  language  pro- 
cessing, and  the  development  of  a number  of  research  paradigms  which  still 
dominate  the  field. 

The  stochastic  paradigm  played  a huge  role  in  the  development  of 
speech  recognition  algorithms  in  this  period,  particularly  the  use  of  the  Hid- 
den Markov  Model  and  the  metaphors  of  the  noisy  channel  and  decoding, 
developed  independently  by  Jelinek,  Bahl,  Mercer,  and  colleagues  at  IBM’s 
Thomas  J.  Watson  Research  Center,  and  Baker  at  Carnegie  Mellon  Univer- 
sity, who  was  influenced  by  the  work  of  Baum  and  colleagues  at  the  Institute 
for  Defense  Analyses  in  Princeton.  AT&T’s  Bell  Laboratories  was  also  a 
center  for  work  on  speech  recognition  and  synthesis;  see  (Rabiner  and  Juang, 
1993)  for  descriptions  of  the  wide  range  of  this  work. 

The  logic-based  paradigm  was  begun  by  the  work  of  Colmerauer  and 
his  colleagues  on  Q-systems  and  metamorphosis  grammars  (Colmerauer, 
1970,  1975),  the  forerunners  of  Prolog  and  Definite  Clause  Grammars  (Pereira 
and  Warren,  1980).  Independently,  Kay’s  (1979)  work  on  functional  gram- 
mar-, and  shortly  later,  (1982)’s  (1982)  work  on  LFG,  established  the  impor- 
tance of  feature  structure  unification. 

The  natural  language  understanding  field  took  off  during  this  period, 
beginning  with  Terry  Winograd’s  SHRDLU  system  which  simulated  a robot 
embedded  in  a world  of  toy  blocks  (Winograd,  1972a).  The  program  was 
able  to  accept  natural  language  text  commands  ( Move  the  red  block  on  top 
of  the  smaller  green  one)  of  a hitherto  unseen  complexity  and  sophistication. 
His  system  was  also  the  first  to  attempt  to  build  an  extensive  (for  the  time) 
grammar  of  English,  based  on  Halliday’s  systemic  grammar.  Winograd’s 
model  made  it  clear  that  the  problem  of  parsing  was  well-enough  understood 
to  begin  to  focus  on  semantics  and  discourse  models.  Roger  Schank  and  his 
colleagues  and  students  (in  was  often  referred  to  as  the  Yale  School ) built  a 
series  of  language  understanding  programs  that  focused  on  human  concep- 
tual knowledge  such  as  scripts,  plans  and  goals,  and  human  memory  organi- 
zation (Schank  and  Abelson,  1977;  Schank  and  Riesbeck,  1981;  Cullingford, 
1981;  Wilensky,  1983;  Lehnert,  1977).  This  work  often  used  network-based 
semantics  (Quillian,  1968;  Norman  and  Rumelhart,  1975;  Schank,  1972; 
Wilks,  1975c,  1975b;  Kintsch,  1974)  and  began  to  incorporate  Fillmore’s 
notion  of  case  roles  (Fillmore,  1968)  into  their  representations  (Simmons, 
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1973a). 

The  logic-based  and  natural-language  understanding  paradigms  were 
unified  on  systems  that  used  predicate  logic  as  a semantic  representation, 
such  as  the  LUNAR  question-answering  system  (Woods,  1967,  1973). 

The  discourse  modeling  paradigm  focused  on  four  key  areas  in  dis- 
course. Grosz  and  her  colleagues  proposed  ideas  of  discourse  structure  and 
discourse  focus  (Grosz,  1977a;  Sidner,  1983a),  a number  of  researchers  be- 
gan to  work  on  automatic  reference  resolution  (Hobbs,  1978a),  and  the  BDI 
(Belief-Desire-Intention)  framework  for  logic -based  work  on  speech  acts 
was  developed  (Perrault  and  Allen,  1980;  Cohen  and  Perrault,  1979). 

Empiricism  and  Finite  State  Models  Redux:  1983-1993 

This  next  decade  saw  the  return  of  two  classes  of  models  which  had  lost 
popularity  in  the  late  50’s  and  early  60’s,  partially  due  to  theoretical  argu- 
ments against  them  such  as  Chomsky’s  influential  review  of  Skinner’s  Verbal 
Behavior  (Chomsky,  1959b).  The  first  class  was  finite-state  models,  which 
began  to  receive  attention  again  after  work  on  finite-state  phonology  and 
morphology  by  (Kaplan  and  Kay,  1981)  and  finite-state  models  of  syntax  by 
Church  (1980).  A large  body  of  work  on  finite-state  models  will  be  described 
throughout  the  book. 

The  second  trend  in  this  period  was  what  has  been  called  the  ‘return  of 
empiricism' ; most  notably  here  was  the  rise  of  probabilistic  models  through- 
out speech  and  language  processing,  influenced  strongly  by  the  work  at  the 
IBM  Thomas  J.  Watson  Research  Center  on  probabilistic  models  of  speech 
recognition.  These  probabilistic  methods  and  other  such  data-driven  ap- 
proaches spread  into  paid  of  speech  tagging,  parsing  and  attachment  ambi- 
guities, and  connectionist  approaches  from  speech  recognition  to  semantics. 

This  period  also  saw  considerable  work  on  natural  language  genera- 
tion. 

The  Field  Comes  Together:  1994-1999 

By  the  last  five  years  of  the  millennium  it  was  clear  that  the  field  was  vastly 
changing.  First,  probabilistic  and  data-driven  models  had  become  quite  stan- 
dard throughout  natural  language  processing.  Algorithms  for  parsing,  paid 
of  speech  tagging,  reference  resolution,  and  discourse  processing  all  began 
to  incorporate  probabilities,  and  employ  evaluation  methodologies  borrowed 
from  speech  recognition  and  information  retrieval.  Second,  the  increases  in 
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the  speed  and  memory  of  computers  had  allowed  commercial  exploitation 
of  a number  of  subareas  of  speech  and  language  processing,  in  particular 
speech  recognition  and  spelling  and  grammar  checking.  Finally,  the  rise  of 
the  Web  emphasized  the  need  for  language-based  information  retrieval  and 
information  extraction. 

A Final  Brief  Note  on  Psychology 

Many  of  the  chapters  in  this  book  include  short  summaries  of  psychological 
research  on  human  processing.  Of  course,  understanding  human  language 
processing  is  an  important  scientific  goal  in  its  own  right,  and  is  paid  of  the 
general  field  of  cognitive  science.  However,  an  understanding  of  human 
language  processing  can  often  be  helpful  in  building  better  machine  mod- 
els of  language.  This  seems  contrary  to  the  popular  wisdom,  which  holds 
that  direct  mimicry  of  nature’s  algorithms  is  rarely  useful  in  engineering  ap- 
plications. For  example  the  argument  is  often  made  that  if  we  copied  nature 
exactly,  airplanes  would  flap  their  wings;  yet  airplanes  with  fixed  wings  arc  a 
more  successful  engineering  solution.  But  language  is  not  aeronautics.  Crib- 
bing from  nature  is  sometimes  useful  for  aeronautics  (after  all,  airplanes  do 
have  wings),  but  it  is  particularly  useful  when  we  are  trying  to  solve  human- 
centered  tasks.  Airplane  flight  has  different  goals  than  bird  flight;  but  the 
goal  of  speech  recognition  systems,  for  example,  is  to  perform  exactly  the 
task  that  human  court  reporters  perform  every  day:  transcribe  spoken  dialog. 
Since  people  already  do  this  well,  we  can  learn  from  nature’s  previous  solu- 
tion. Since  we  arc  building  speech  recognition  systems  in  order  to  interact 
with  people,  it  makes  sense  to  copy  a solution  that  behaves  the  way  people 
arc  accustomed  to. 

1.7  Summary 

This  chapter  introduces  the  field  of  speech  and  language  processing.  The 
following  arc  some  of  the  highlights  of  this  chapter. 

• A good  way  to  understand  the  concerns  of  speech  and  language  pro- 
cessing research  is  to  consider  what  it  would  take  to  create  an  intelli- 
gent agent  like  HAL  from  2001:  A Space  Odyssey. 

• Speech  and  language  technology  relies  on  formal  models,  or  represen- 
tations, of  knowledge  of  language  at  the  levels  of  phonology  and  pho- 
netics, morphology,  syntax,  semantics,  pragmatics  and  discourse.  A 
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small  number  of  formal  models  including  state  machines,  formal  rule 
systems,  logic,  and  probability  theory  arc  used  to  capture  this  knowl- 
edge. 

• The  foundations  of  speech  and  language  technology  lie  in  computer 
science,  linguistics,  mathematics,  electrical  engineering  and  psychol- 
ogy. A small  number  of  algorithms  from  standard  frameworks  arc  used 
throughout  speech  and  language  processing, 

• The  critical  connection  between  language  and  thought  has  placed  speech 
and  language  processing  technology  at  the  center  of  debate  over  intel- 
ligent machines.  Furthermore,  research  on  how  people  interact  with 
complex  media  indicates  that  speech  and  language  processing  technol- 
ogy will  be  critical  in  the  development  of  future  technologies. 

• Revolutionary  applications  of  speech  and  language  processing  arc  cur- 
rently in  use  around  the  world.  Recent  advances  in  speech  recognition 
and  the  creation  of  the  World-Wide  Web  will  lead  to  many  more  appli- 
cations. 


Bibliographical  and  Historical  Notes 

Research  in  the  various  subareas  of  speech  and  language  processing  is  spread 
across  a wide  number  of  conference  proceedings  and  journals.  The  con- 
ferences and  journals  most  centrally  concerned  with  computational  linguis- 
tics and  natural  language  processing  arc  associated  with  the  Association  for 
Computational  Linguistics  (ACL),  its  European  counterpart  (EACL),  and  the 
International  Conference  on  Computational  Linguistics  (COLING).  The  an- 
nual proceedings  of  ACL  and  EACL,  and  the  biennial  COLING  conference 
arc  the  primary  forums  for  work  in  this  area.  Related  conferences  include 
the  biennial  conference  on  Applied  Natural  Language  Processing  (ANLP) 
and  the  conference  on  Empirical  Methods  in  Natural  Language  Processing 
(EMNLaP).  The  journal  Computational  Linguistics  is  the  premier  publica- 
tion in  the  field,  although  it  has  a decidedly  theoretical  and  linguistic  ori- 
entation. The  journal  Natural  Language  Engineering  covers  more  practical 
applications  of  speech  and  language  research. 

Research  on  speech  recognition,  understanding,  and  synthesis  is  pre- 
sented at  the  biennial  International  Conference  on  Spoken  Language  Pro- 
cessing (ICSLP)  which  alternates  with  the  European  Conference  on  Speech 
Communication  and  Technology  (EUROSPEECH).  The  IEEE  International 
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Conference  on  Acoustics,  Speech,  & Signal  Processing  (IEEE  ICASSP) 
is  held  annually,  as  is  the  meeting  of  the  Acoustical  Society  of  America. 
Speech  journals  include  Speech  Communication , Computer  Speech  and  Lan- 
guage, and  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelli- 
gence. 

Work  on  language  processing  from  an  Artificial  Intelligence  perspec- 
tive can  be  found  in  the  annual  meetings  of  the  American  Association  for  Ar- 
tificial Intelligence  (AAAI),  as  well  as  the  biennial  International  Joint  Con- 
ference on  Artificial  Intelligence  (IJCAI)  meetings.  The  following  artificial 
intelligence  publications  periodically  feature  work  on  speech  and  language 
processing:  Artificial  Intelligence,  Computational  Intelligence,  IEEE  Trans- 
actions on  Intelligent  Systems,  and  the  Journal  of  Artificial  Intelligence  Re- 
search. Work  on  cognitive  modeling  of  language  can  be  found  at  the  annual 
meeting  of  the  Cognitive  Science  Society,  as  well  as  its  journal  Cognitive 
Science.  An  influential  series  of  closed  workshops  was  held  by  ARPA,  called 
variously  the  DARPA  Speech  and  Natural  Language  Processing  Workshop  or 
the  ARPA  Workshop  on  Human  Language  Technology. 

The  arc  a fair  number  of  textbooks  available  covering  various  aspects 
of  speech  and  language  processing.  (Manning  and  Schiitze,  1999)  ( Foun- 
dations of  Statistical  Language  Processing ) focuses  on  statistical  models  of 
tagging,  parsing,  disambiguation,  collocations,  and  other  areas.  Charniak 
(1993)  ( Statistical  Language  Learning ) is  an  accessible,  though  less  exten- 
sive, introduction  to  similar  material.  Allen  (1995)  (. Natural  Language  Un- 
derstanding) provides  extensive  coverage  of  language  processing  from  the 
AI  perspective.  (Gazdar  and  Mellish,  1989)  (. Natural  Language  Process- 
ing in  Lisp/Prolog)  covers  especially  automata,  parsing,  features,  and  uni- 
fication. (Pereira  and  Shieber,  1987)  gives  a Prolog-based  introduction  to 
parsing  and  interpretation.  Russell  and  Norvig  (1995)  is  an  introduction  to 
artificial  intelligence  that  includes  chapters  on  natural  language  processing. 
Partee  (1990)  has  a very  broad  coverage  of  mathematical  linguistics.  (Cole, 
1997)  is  a volume  of  survey  papers  covering  the  entire  field  of  speech  and 
language  processing.  A somewhat  dated  but  still  tremendously  useful  col- 
lection of  foundational  papers  can  be  found  in  (Grosz  el  al. , 1986)  ( Readings 
in  Natural  Language  Processing). 

Of  course,  a w ide- variety  of  speech  and  language  processing  resources 
arc  now  available  on  the  World-Wide  Web.  Pointers  to  these  resources  arc 
maintained  on  the  homepage  for  this  book  at  www.cs.colorado.edu/  mar- 
tin/slp.html. 


Part  I 

WORDS 


Words  are  the  fundamental  building  block  of  language.  Every  human 
language,  spoken,  signed,  or  written,  is  composed  of  words.  Every 
area  of  speech  and  language  processing,  from  speech  recognition  to 
machine  translation  to  information  retrieval  on  the  web,  requires  ex- 
tensive knowledge  about  words.  Psycholinguistic  models  of  human 
language  processing  and  models  from  generative  linguistic  are  also 
heavily  based  on  lexical  knowledge. 

The  six  chapters  in  this  part  introduce  computational  models 
of  the  spelling,  pronunciation,  and  morphology  of  words  and  cover 
three  important  real-world  tasks  that  rely  on  lexical  knowledge:  au- 
tomatic speech  recognition  (ASR),  text-to-speech  synthesis  (TTS), 
and  spell-checking.  Finally,  these  chapters  define  perhaps  the  most 
important  computational  model  for  of  speech  and  language  process- 
ing: the  automaton.  Four  kinds  of  automata  are  covered:  finite- 
state  automata  (FSAs)  and  regular  expressions,  finite-state  transducers 
(FSTs),  weighted  transducers,  and  the  Hidden  Markov  Model  (HMM), 
as  well  as  the  A-gram  model  of  word  sequences. 


REGULAR  EXPRESSIONS 
AND  AUTOMATA 


“In  the  old  days,  if  you  wanted  to  impeach  a witness  you  had  to 
go  back  and  fumble  through  endless  transcripts.  Now  it’s  on  a 
screen  somewhere  or  on  a disk  and  1 can  search  for  a particular 
word  - say  every  time  the  witness  used  the  word  glove  - and  then 
quickly  ask  a question  about  what  he  said  years  ago.  Right  away 
you  see  the  witness  get  flustered.” 

Johnnie  L.  Cochran  Jr.,  attorney,  New  York  Times , 9/28/97 


Imagine  that  you  have  become  a passionate  fan  of  woodchucks.  De- 
siring more  information  on  this  celebrated  woodland  creature,  you  turn  to 
your  favorite  web  browser  and  type  in  woodchuck.  Your  browser  returns  a 
few  sites.  You  have  a flash  of  inspiration  and  type  in  woodchucks.  This  time 
you  discover  ‘interesting  links  to  woodchucks  and  lemurs’  and  ‘all  about 
Vermont’s  unique,  endangered  species’.  Instead  of  having  to  do  this  search 
twice,  you  would  have  rather  typed  one  search  command  specifying  some- 
thing like  woodchuck  with  an  optional  final  s.  Furthermore,  you  might  want 
to  find  a site  whether  or  not  it  spelled  woodchucks  with  a capital  W ( Wood- 
chuck).  Or  perhaps  you  might  want  to  search  for  all  the  prices  in  some  docu- 
ment; you  might  want  to  see  all  strings  that  look  like  $199  or  $25  or  $24.99. 
In  this  chapter  we  introduce  the  regular  expression,  the  standard  notation 
for  characterizing  text  sequences.  The  regular  expression  is  used  for  spec- 
ifying text  strings  in  situations  like  this  web-search  example,  and  in  other 
information  retrieval  applications,  but  also  plays  an  important  role  in  word- 
processing (in  PC,  Mac,  or  UNIX  applications),  computation  of  frequencies 
from  corpora,  and  other  such  tasks. 
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After  we  have  defined  regular  expressions,  we  show  how  they  can  be 
implemented  via  the  finite-state  automaton.  The  finite-state  automaton  is 
not  only  the  mathematical  device  used  to  implement  regular  expressions,  but 
also  one  of  the  most  significant  tools  of  computational  linguistics.  Variations 
of  automata  such  as  finite-state  transducers.  Hidden  Markov  Models,  and 
/V-grarn  grammars  arc  important  components  of  the  speech  recognition  and 
synthesis,  spell-checking,  and  information-extraction  applications  that  we 
will  introduce  in  later  chapters. 


2.1  Regular  Expressions 

SIR  ANDREW  Her  C’s,  her  U’s  and  her  T’s:  why  that? 

Shakespeare,  Twelfth  Night 

One  of  the  unsung  successes  in  standardization  in  computer  science 
expression  has  been  the  regular  expression  (RE),  a language  for  specifying  text  search 
strings.  The  regular  expression  languages  used  for  searching  texts  in  UNIX 
(vi,  Perl,  Emacs,  grep),  Microsoft  Word  (version  6 and  beyond),  and  Word- 
Perfect arc  almost  identical,  and  many  RE  features  exist  in  the  various  Web 
search  engines.  Besides  this  practical  use,  the  regular  expression  is  an  im- 
portant theoretical  tool  throughout  computer  science  and  linguistics. 

A regular  expression  (first  developed  by  Kleene  (1956)  but  see  the  His- 
tory section  for  more  details)  is  a formula  in  a special  language  that  is  used 
strings  for  specifying  simple  classes  of  strings.  A string  is  a sequence  of  symbols; 

for  the  purpose  of  most  text-based  search  techniques,  a string  is  any  sequence 
of  alphanumeric  characters  (letters,  numbers,  spaces,  tabs,  and  punctuation). 
For  these  purposes  a space  is  just  a character  like  any  other,  and  we  represent 
it  with  the  symbol 

Formally,  a regular  expression  is  an  algebraic  notation  for  characteriz- 
ing a set  of  strings.  Thus  they  can  be  used  to  specify  search  strings  as  well  as 
to  define  a language  in  a formal  way.  We  will  begin  by  talking  about  regular 
expressions  as  a way  of  specifying  searches  in  texts,  and  proceed  to  other 
uses.  Section  2.3  shows  that  the  use  of  just  three  regular  expression  opera- 
tors is  sufficient  to  characterize  strings,  but  we  use  the  more  convenient  and 
commonly-used  regular  expression  syntax  of  the  Perl  language  throughout 
this  section.  Since  common  text-processing  programs  agree  on  most  of  the 
syntax  of  regular  expressions,  most  of  what  we  say  extends  to  all  UNIX,  Mi- 
crosoft Word,  and  WordPerfect  regular  expressions.  Appendix  A shows  the 
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few  areas  where  these  programs  differ  from  the  Perl  syntax. 

Regular  expression  search  requires  a pattern  that  we  want  to  search 
for,  and  a corpus  of  texts  to  search  through.  A regular  expression  search  corpus 
function  will  search  through  the  corpus  returning  all  texts  that  contain  the 
pattern.  In  an  information  retrieval  (IR)  system  such  as  a web  search  engine, 
the  texts  might  be  entire  documents  or  web  pages.  In  a word-processor,  the 
texts  might  be  individual  words,  or  lines  of  a document.  In  the  rest  of  this 
chapter,  we  will  use  this  last  paradigm.  Thus  when  we  give  a search  pattern, 
we  will  assume  that  the  search  engine  returns  the  line  of  the  document  re- 
turned. This  is  what  the  UNIX  ‘grep’  command  does.  We  will  underline  the 
exact  part  of  the  pattern  that  matches  the  regular  expression.  A search  can  be 
designed  to  return  all  matches  to  a regular  expression  or  only  the  first  match. 

We  will  show  only  the  first  match. 

Basic  Regular  Expression  Patterns 

The  simplest  kind  of  regular  expression  is  a sequence  of  simple  characters. 

For  example,  to  search  for  woodchuck,  we  type  /woodchuck/.  So  the  reg- 
ular- expression  /Buttercup/  matches  any  string  containing  the  substring 
Buttercup,  for  example  the  line  I’m  called  little  Buttercup)  (recall  that  we 
are  assuming  a search  application  that  returns  entire  lines).  From  here  on 
we  will  put  slashes  around  each  regular  expression  to  make  it  clear  what  is 
a regular  expression  and  what  is  a pattern.  We  use  the  slash  since  this  is  the 
notation  used  by  Perl,  but  the  slashes  are  not  part  of  the  regular  expressions. 

The  search  string  can  consist  of  a single  letter  (like  / ! /)  or  a sequence 
of  letters  (like  /urgl /);  The  first  instance  of  each  match  to  the  regular  ex- 
pression is  underlined  below  (although  a given  application  might  choose  to 
return  more  than  just  the  first  instance): 


RE 

Example  Patterns  Matched 

/woodchucks/ 

/a/ 

/Claire^says , / 
/song/ 

/!/ 

“interesting  links  to  woodchucks  and  lemurs” 
“Mary  Ann  stopped  by  Mona’s” 

“Dagmar,  my  gift  please,”  Claire  says,” 

“all  our  pretty  songs” 

“You’ve  left  the  burglar  behind  again/”  said  Nori 

Regular  expressions  are  case  sensitive;  lower-case  / s / is  distinct  from 
upper-case  / S / ; (/  s / matches  a lower  case  s but  not  an  upper-case  S).  This 
means  that  the  pattern  /woodchucks/  will  not  match  the  string  Wood- 
chucks. We  can  solve  this  problem  with  the  use  of  the  square  braces  [ and  ] . 
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The  string  of  characters  inside  the  braces  specify  a disjunction  of  characters 
to  match.  For  example  Figure  2. 1 shows  that  the  pattern  / [ wW  ] / matches 
patterns  containing  either  w or  W. 


RE 

Match 

Example  Patterns 

/ [ wW] oodchuck/ 
/ [abc]  / 

/ [1234567890] / 

Woodchuck  or  woodchuck 
‘a’,  ‘b’,  or  ‘c’ 
any  digit 

“Woodchuck” 

“In  uomini,  in  soldati” 
“plenty  of  7 to  5” 

Figure  2.1  The  use  of  the  brackets  []  to  specify  a disjunction  of  characters. 

The  regular  expression  / [1234567890]  / specified  any  single  digit. 
While  classes  of  characters  like  digits  or  letters  are  important  building  blocks 
in  expressions,  they  can  get  awkward  (e.g.  it’s  inconvenient  to  specify 

/ [ABCDEFGHI JKLMNOPQRSTUVWXYZ] / 

to  mean  ‘any  capital  letter’).  In  these  cases  the  brackets  can  be  used  with 
range  the  dash  (-)  to  specify  any  one  character  in  a range.  The  pattern  / [ 2— 
5 ] / specifies  any  one  of  the  characters  2,  3,  4,  or  5.  The  pattern  / [b-g]  / 
specifies  one  of  the  characters  b,  c,  d,  e,f,  or  g.  Some  other  examples: 


RE 

Match 

Example  Patterns  Matched 

/ [A-Z]  / 
/ [a-z  ] / 
/ [0-9]  / 

an  uppercase  letter 
a lowercase  letter 
a single  digit 

“we  should  call  it  ‘Drenched  Blossoms’” 
“my  beans  were  impatient  to  be  hoed!” 
“Chapter  1:  Down  the  Rabbit  Hole” 

Figure  2.2  The  use  of  the  brackets  [ ] plus  the  dash  - to  specify  a range. 

The  square  braces  can  also  be  used  to  specify  what  a single  charac- 
ter cannot  be,  by  use  of  the  caret  ~ . If  the  caret  ~ is  the  first  symbol  after 
the  open  square  brace  [ , the  resulting  pattern  is  negated.  For  example,  the 
pattern  / [ ~ a ] / matches  any  single  character  (including  special  characters) 
except  a.  This  is  only  true  when  the  caret  is  the  first  symbol  after  the  open 
square  brace.  If  it  occurs  anywhere  else,  it  usually  stands  for  a caret;  Fig- 
ure 2.3  shows  some  examples. 

The  use  of  square  braces  solves  our  capitalization  problem  for  wood- 
chucks. But  we  still  haven’t  answered  our  original  question;  how  do  we 
specify  both  woodchuck  and  woodchucks' ? We  can’t  use  the  square  brack- 
ets, because  while  they  allow  us  to  say  ‘s  or  S’,  they  don’t  allow  us  to  say 
‘s  or  nothing’.  For  this  we  use  the  question-mark  /?/,  which  means  ‘the 
preceding  character  or  nothing’,  as  shown  in  Figure  2.4. 
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RE 

Match  (single  characters) 

Example  Patterns  Matched 

TA— Z] 

not  an  uppercase  letter 

“Oyfn  pripetchik” 

rss] 

neither  ‘S’  nor  ‘s’ 

“I  have  no  exquisite  reason  for’t” 

r\.i 

not  a period 

“our  resident  Djinn” 

[e“] 

either  ‘e’  or  ‘ ' ’ 

“look  up  2_  now” 

a~b 

the  pattern  ‘a~b’ 

“look  up  a"  b now” 

Figure  2.3  Uses  of  the  caret  ~ for  negation  or  just  to  mean  " 

RE 

Match 

Example  Patterns  Matched 

woodchucks? 

woodchuck  or  woodchucks 

“woodchuck” 

colou?r 

color  or  colour 

“colour” 

Figure  2.4 

sion. 

The  question-mark  ? marks  optionality  of  the  previous  expres- 

We  can  think  of  the  question-mark  as  meaning  ‘zero  or  one  instances 
of  the  previous  character’.  That  is,  it’s  a way  of  specifying  how  many  of 
something  that  we  want.  So  far  we  haven’t  needn’t  to  specify  that  we  want 
more  than  one  of  something.  But  sometimes  we  need  regular  expressions 
that  allow  repetitions  of  things.  For  example,  consider  the  language  of  (cer- 
tain) sheep,  which  consists  of  strings  that  look  like  the  following: 

baa! 

baaa! 

baaaa! 

baaaaa! 

baaaaaa! 


This  language  consists  of  strings  with  a b , followed  by  at  least  2 a’s, 
followed  by  an  exclamation  point.  The  set  of  operators  that  allow  us  to  say 
things  like  “some  number  of  ‘a’s”  arc  based  on  the  asterisk  or  *,  commonly 
called  the  Kleene  * (pronounced  “cleany  star”).  The  Kleene  star  means  kleene* 
‘zero  or  more  occurrences  of  the  immediately  previous  character  or  regular 
expression’.  So  /a*/  means  ‘any  string  of  zero  or  more  a’s’.  This  will 
match  a or  aaaaaa  but  it  will  also  match  Off  Minor,  since  the  string  Off 
Minor  has  zero  a’s.  So  the  regular  expression  for  matching  one  or  more 
a is  /aa*/,  meaning  one  a followed  by  zero  or  more  a’s.  More  complex 
patterns  can  also  be  repeated.  So  / [ ab  ] * / means  ‘zero  or  more  ‘a’s  or  ‘b’s’ 

(not  ‘zero  or  more  right  square  braces).  This  will  match  strings  like  aaaa  or 
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KLEENE  + 


ANCHORS 


ababab  or  bbbb. 

We  now  know  enough  to  specify  part  of  our  regular  expression  for 
prices:  multiple  digits.  Recall  that  the  regular  expression  for  an  individual 
digit  was  / [ 0-9]  /.  So  the  regular  expression  for  an  integer  (a  string  of 
digits)  is  / [0-9]  [0-9]  */.  (Why  isn’t  it  just  / [0-9]  */)? 

Sometimes  it’s  annoying  to  have  to  write  the  regular  expression  for  dig- 
its twice,  so  there  is  a shorter  way  to  specify  ‘at  least  one’  of  some  character. 
This  is  the  Kleene  +,  which  means  ‘one  or  more  of  the  previous  character’. 
Thus  the  expression  / [ 0-9  ] + / is  the  normal  way  to  specify  ‘a  sequence  of 
digits’.  There  arc  thus  two  ways  to  specify  the  sheep  language:  /baaa*  ! / 
or  /baa+ ! /. 

One  very  important  special  character  is  the  period  (/  . / , a wildcard 
expression  that  matches  any  single  character  {except  a carriage  return): 


RE 

Match 

Example  Patterns 

/beg . n/ 

any  character  between  ‘beg’  and  ‘n’ 

begin,  beg’n,  begun 

Figure  2.5 

The  use  of  the  period  . to  specify  any  character. 

The  wildcard  is  often  used  together  with  the  Kleene  star  to  mean  ‘any 
string  of  characters’.  For  example  suppose  we  want  to  find  any  line  in  which 
a particular  word,  for  example  aardvark , appeal's  twice.  We  can  specify  this 
with  the  regular  expression  /aardvark  . *aardvark/. 

Anchors  are  special  characters  that  anchor  regular  expressions  to  par- 
ticular places  in  a string.  The  most  common  anchors  are  the  caret  ~ and  the 
dollar-sign  $.  The  caret  ~ matches  the  start  of  a line.  The  pattern  / "The/ 
matches  the  word  The  only  at  the  start  of  a line.  Thus  there  are  three  uses 
of  the  caret  ~ : to  match  the  start  of  a line,  as  a negation  inside  of  square 
brackets,  and  just  to  mean  a caret.  (What  are  the  contexts  that  allow  Perl  to 
know  which  function  a given  caret  is  supposed  to  have?).  The  dollar  sign  $ 
matches  the  end  of  a line.  So  the  pattern  is  a useful  pattern  for  matching 
a space  at  the  end  of  a line,  and  / ~ The  dog\  . $ / matches  a line  that  con- 
tains only  the  phrase  The  dog.  (We  have  to  use  the  backslash  here  since  we 
want  the  . to  mean  ‘period’  and  not  the  wildcard). 

There  are  also  two  other  anchors:  \b  matches  a word  boundary,  while 
\B  matches  a non-boundary.  Thus  /\bthe\b/  matches  the  word  the  but 
not  the  word  other.  More  technically,  Perl  defines  a word  as  any  sequence 
of  digits,  underscores  or  letters;  this  is  based  on  the  definition  of  ‘words’  in 
programming  languages  like  Perl  or  C.  For  example,  /\b9  9/  will  match 
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the  string  99  in  There  are  99  bottles  of  beer  on  the  wall  (because  99  follows 
a space)  but  not  99  in  There  are  299  bottles  of  beer  on  the  wall  (since  99 
follows  a number).  But  it  will  match  99  in  $99  (since  99  follows  a dollar 
sign  ($),  which  is  not  a digit,  underscore,  or  letter). 

Disjunction,  Grouping,  and  Precedence 

Suppose  we  need  to  search  for  texts  about  pets;  perhaps  we  arc  particularly 
interested  in  cats  and  dogs.  In  such  a case  we  might  want  to  search  for  either 
the  string  cat  or  the  string  dog.  Since  we  can’t  use  the  square-brackets  to 
search  for  ‘cat  or  dog’  (why  not?)  we  need  a new  operator,  the  disjunction  disjunction 
operator,  also  called  the  pipe  symbol  | . The  pattern  / cat  | dog/  matches 
either  the  string  cat  or  the  string  dog. 

Sometimes  we  need  to  use  this  disjunction  operator  in  the  midst  of 
a larger  sequence.  For  example,  suppose  I want  to  search  for  information 
about  pet  fish  for  my  cousin  David.  How  can  I specify  both  guppy  and 
guppies ? We  cannot  simply  say  /guppy  | ies/,  because  that  would  match 
only  the  strings  guppy  and  ies.  This  is  because  sequences  like  guppy  take 
precedence  over  the  disjunction  operator  | . In  order  to  make  the  disjunction  precedence 
operator  apply  only  to  a specific  pattern,  we  need  to  use  the  parenthesis 
operators  ( and  ) . Enclosing  a pattern  in  parentheses  makes  it  act  like  a 
single  character  for  the  puiposes  of  neighboring  operators  like  the  pipe  | 
and  the  Kleene*.  So  the  pattern  /gupp  (y  | ies)  / would  specify  that  we 
meant  the  disjunction  only  to  apply  to  the  suffixes  y and  ies. 

The  parenthesis  operator  ( is  also  useful  when  we  are  using  counters 
like  the  Kleene*.  Unlike  the  | operator,  the  Kleene*  operator  applies  by 
default  only  to  a single  character,  not  a whole  sequence.  Suppose  we  want 
to  match  repeated  instances  of  a string.  Perhaps  we  have  a line  that  has 
column  labels  of  the  form  Column  1 Column  2 Column  3.  The  expression 
/Column^  [ 0-9  ] +,_,*/  will  not  match  any  column;  instead,  it  will  match 
a column  followed  by  any  number  of  spaces ! The  star  here  applies  only  to 
the  space  that  precedes  it,  not  the  whole  sequence.  With  the  parentheses, 
we  could  write  the  expression  / (Column^  [ 0-9  ] +,_,* ) */  to  match  the 
word  Column , followed  by  a number  and  optional  spaces,  the  whole  pattern 
repeated  any  number  of  times. 

This  idea  that  one  operator  may  take  precedence  over  another,  requir- 
ing us  to  sometimes  use  parentheses  to  specify  what  we  mean,  is  formalized 
by  the  operator  precedence  hierarchy  for  regular  expressions.  The  follow-  prIcedence 
ing  table  gives  the  order  of  RE  operator  precedence,  from  highest  precedence 
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to  lowest  precedence: 

Parenthesis  ( ) 

Counters  * + ? { } 

Sequences  and  anchors  the  "'my  end$ 

Disjunction  I 

Thus,  because  counters  have  a higher  precedence  than  sequences,  /the*  / 
matches  theeeee  but  not  thethe.  Because  sequences  have  a higher  precedence 
than  disjunction,  /the  | any/  matches  the  or  any  but  not  theny. 

Patterns  can  be  ambiguous  in  another  way.  Consider  the  expression 
/ [a-z]  */  when  matching  against  the  text  once  upon  a time.  Since  / [a- 
z ] * / matches  zero  or  more  letters,  this  expression  could  match  nothing,  or 
just  the  first  letter  o,  or  on,  or  one,  or  once.  In  these  cases  regular  expressions 
greedy  always  match  the  largest  string  they  can;  we  say  that  patterns  arc  greedy, 
expanding  to  cover  as  much  of  a string  as  they  can. 

A simple  example 

Suppose  we  wanted  to  write  a RE  to  find  cases  of  the  English  article  the.  A 
simple  (but  incorrect)  pattern  might  be: 

/the/ 

One  problem  is  that  this  pattern  will  miss  the  word  when  it  begins 
a sentence  and  hence  is  capitalized  (i.e.  The).  This  might  lead  us  to  the 
following  pattern: 

/ [ t T ] he/ 

But  we  will  still  incorrectly  return  texts  with  the  embedded  in  other 
words  (e.g.  other  or  theology ).  So  we  need  to  specify  that  we  want  instances 
with  a word  boundary  on  both  sides: 

/\b[tT]he\b/ 

Suppose  we  wanted  to  do  this  without  the  use  of  /\b/?  We  might 
want  this  since  /\b/  won’t  treat  underscores  and  numbers  as  word  bound- 
aides;  but  we  might  want  to  find  the  in  some  context  where  it  might  also  have 
underlines  or  numbers  nearby  (the_  or  tlie25).  We  need  to  specify  that  we 
want  instances  in  which  there  arc  no  alphabetic  letters  on  either  side  of  the 
the : 


/["a-z]  [ t T ] he [ "a-z ] / 
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But  there  is  still  one  more  problem  with  this  pattern:  it  won’t  find  the 
word  the  when  it  begins  a line.  This  is  because  the  regular  expression  [ ' a- 
z ] , which  we  used  to  avoid  embedded  the s,  implies  that  there  must  be  some 
single  (although  non-alphabetic)  character  before  the  the.  We  can  avoid 
this  by  specifying  that  before  the  the  we  require  either  the  beginning-of-line 
or  a non-alphabetic  character: 

/ (~ I [~a-z] ) [ tT ] he [ ~ a-z ] / 

A More  Complex  Example 

Let’s  try  out  a more  significant  example  of  the  power  of  REs.  Suppose  we 
want  to  build  an  application  to  help  a user  buy  a computer  on  the  web.  The 
user  might  want  ‘any  PC  with  more  than  500  Mhz  and  32  Gb  of  disk  space 
for  less  than  $1000’.  In  order  to  do  this  kind  of  retrieval  we  will  first  need  to 
be  able  to  look  for  expressions  like  500  MHz  or  3.5  Gb  or  32  Megabytes,  or 
Compaq  or  Mac  or  $999.99.  In  the  rest  of  this  section  we'll  work  out  some 
simple  regular  expressions  for  this  task. 

First,  let’s  complete  our  regular  expression  for  prices.  Here’s  a regular 
expression  for  a dollar  sign  followed  by  a string  of  digits.  Note  that  Perl  is 
smart  enough  to  realize  that  $ here  doesn’t  mean  end-of-line;  how  might  it 
know  that? 

/$  [ 0-9] +/ 

Now  we  just  need  to  deal  with  fractions  of  dollars.  We'll  add  a decimal 
point  and  two  digits  afterwards: 

/$ [0-9] +\ . [0-9]  [0-9] / 

This  pattern  only  allows  $1 99. 99  but  not  $1 99.  We  need  to  make  the 
cents  optional,  and  make  sure  we’re  at  a word  boundary: 

/\b$ [0-9] + (\.  [0-9]  [0-9] ) ?\b/ 

How  about  specifications  for  processor  speed  (in  Megahertz  = Mhz  or 
Gigahertz  = Ghz)?  Here’s  a pattern  for  that: 

/ \b  [ 0-9  ]+,_,*  (Mhz  | [Mm]  egahertz  | Ghz  | [Gg]  igahertz  ) \b/ 

Note  that  we  use  /,_,*/  to  mean  ’zero  or  more  spaces’,  since  there 
might  always  be  extra  spaces  lying  around.  Dealing  with  disk  space  (in  Gb 
= gigabytes),  or  memory  size  (in  Mb  = megabytes  or  Gb  = gigabytes),  we 
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need  to  allow  for  optional  gigabyte  fractions  again  (5.5  Gb).  Note  the  use  of 
? for  making  the  final  s optional: 

/\b[0  — 9]  (Mb  | [Mm]  egabytes? ) \b/ 

/\b  [0—9]  ( \ .[  0-9  ]+)?,_,*  (Gb  | [Gg]  igabytes? ) \b/ 

Finally,  we  might  want  some  simple  patterns  to  specify  operating  sys- 
tems and  vendors: 

/\b(Win|Win95|Win98 | WinNT | Windows^* (NT | 95 | 98 ) ? ) \b/ 
/\b (Mac | Macintosh | Apple) \b/ 

Advanced  Operators 


RE 

Expansion 

Match 

Example  Patterns 

\d 

[0-9] 

any  digit 

Party  ^0^5 

\D 

[ "0-9] 

any  non-digit 

Blue^moon 

\w 

[a-zA-Z0-91_1] 

any  alphanumeric  or  space 

Daiyu 

\W 

T\w] 

a non-alphanumeric 

MM 

\s 

[^XrXtXnXf ] 

whitespace  (space,  tab) 

\s 

T\s] 

Non-whitespace 

in,  .Concord 

Figure  2.6  Aliases  for  common  sets  of  characters. 

There  are  also  some  useful  advanced  regular  expression  operators.  Fig- 
ure 2.6  shows  some  useful  aliases  for  common  ranges,  which  can  be  used 
mainly  to  save  typing.  Besides  the  Kleene  * and  Kleene  +,  we  can  also  use 
explicit  numbers  as  counters,  by  enclosing  them  in  curly  brackets.  The  reg- 
ular- expression  / { 3 } / means  “exactly  3 occurrences  of  the  previous  char- 
acter or  expression”.  So  /a\.  {24}z/  will  match  a followed  by  24  dots 
followed  by  z (but  not  a followed  by  23  or  25  dots  followed  by  a z). 

A range  of  numbers  can  also  be  specified;  so  / { n,  m}  / specifies  from 
n to  m occurrences  of  the  previous  char  or  expression,  while  / { n , } / means 
at  least  n occurrences  of  the  previous  expression.  REs  for  counting  are  sum- 
marized in  Figure  2.7. 

Finally,  certain  special  characters  are  referred  to  by  special  notation 
newline  based  on  the  backslash  (\).  The  most  common  of  these  are  the  newline 
character  \n  and  the  tab  character  \t.  To  refer  to  characters  that  are  special 
themselves,  (like  . , [,  and  \),  precede  them  with  a backslash,  (i.e.  / \ . /, 

/ \*/,  /\  [ /,  and  / \ \ /). 
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RE 

Match 

* 

+ 

9 

{n} 
{n,  m} 
{n,  } 

zero  or  more  occurrences  of  the  previous  char  or  expression 
one  or  more  occurrences  of  the  previous  char  or  expression 
exactly  zero  or  one  occurrence  of  the  previous  char  or  expression 
n occurrences  of  the  previous  char  or  expression 
from  n to  m occurrences  of  the  previous  char  or  expression 
at  least  n occurrences  of  the  previous  char  or  expression 

Figure  2.7  Regular  expression  operators  for  counting. 

RE 

Match 

Example  Patterns  Matched 

\* 

\. 

\? 

\n 

\t 

an  asterisk 
a period 
a question  mark 
a newline 
a tab 

“K*A*P*L*A*N” 

“Dig  Livingston,  I presume” 
“Would  you  light  my  candle?” 

Figure  2.8  Some  characters  that  need  to  be  backslashed. 

The  reader  should  consult  Appendix  A for  further  details  of  regular 
expressions,  and  especially  for  the  differences  between  regular  expressions 
in  Perl,  UNIX,  and  Microsoft  Word. 

Regular  Expression  Substitution,  Memory,  and  ELIZA 

An  important  use  of  regular  expressions  is  in  substitutions.  For  example,  the 
Perl  substitution  operator  s/regexpl/regexp2/  allows  a string  charac- 
terized by  one  regular  expression  to  be  replaced  by  a string  characterized  by 
a different  regular  expression: 

s/ colour/color/ 

It  is  often  useful  to  be  able  to  refer  to  a particular  subpart  of  the  string 
matching  the  first  pattern.  For  example,  suppose  we  wanted  to  put  angle 
brackets  around  all  integers  in  a text,  changing  e.g.  the  35  boxes  to  the 
<35>  boxes.  We’d  like  a way  to  refer  back  to  the  integer  we’ve  found  so 
that  we  can  easily  add  the  brackets.  To  do  this,  we  put  parentheses  ( and 
) around  the  first  pattern,  and  use  the  number  operator  \ 1 in  the  second 
pattern  to  refer  back.  Here’s  how  it  looks: 


SUBSTITU- 

TIONS 


S/ ( [0-9]+) /<\1>/ 
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The  parenthesis  and  number  operators  can  also  be  used  to  specify  that 
a certain  string  or  expression  must  occur  twice  in  the  text.  For  example, 
suppose  we  arc  looking  for  the  pattern  ‘the  Xer  they  were,  the  Xer  they  will 
be’,  where  we  want  to  constrain  the  two  X’s  to  be  the  same  string.  We  do 
this  by  surrounding  the  first  X with  the  parenthesis  operator,  and  replacing 
the  second  X with  the  number  operator  \1,  as  follows: 

/the  (,*)er  they  were,  the  \ler  they  will  be/ 

Here  the  \ 1 will  be  replaced  by  whatever  string  matched  the  first  item  in 
parentheses.  So  this  will  match  The  bigger  they  were,  the  bigger  they  will  be 
but  not  The  bigger  they  were,  the  faster  they  will  be. 

The  number  operator  can  be  used  with  other  numbers:  if  you  match 
two  different  sets  of  parenthesis,  \2  means  whatever  matched  the  second 
set.  For  example 

/the  (,*)er  they  (.*),  the  \ler  they  \2/ 

will  match  The  bigger  they  were,  the  bigger  they  were  but  not  The  bigger  they 
registers  were,  the  bigger  they  will  be.  These  numbered  memories  are  called  registers 
(e.g.  register  1,  register  2,  register  3,  etc).  This  memory  feature  is  not  part 
of  every  regular  expression  language,  and  is  often  considered  an  ‘extended’ 
feature  of  regular  expressions. 

Substitutions  using  memory  are  very  useful  in  implementing  a simple 
natural-language  understanding  program  like  EFIZA  (Weizenbaum,  1966). 
Recall  that  EFIZA  simulated  a Rogerian  psychologist,  and  could  carry  on 
conversations  with  the  user  like  the  following: 

User] : Men  arc  all  alike. 

EFIZA  i:  IN  WHAT  WAY 

User2:  They’re  always  bugging  us  about  something  or  other.. 

EFIZA2:  CAN  YOU  THINK  OF  A SPECIFIC  EXAMPFE 
User3:  Well,  my  boyfriend  made  me  come  here. 

EFIZA3:  YOUR  BOYFRIEND  MADE  YOU  COME  HERE 
Usci'4:  He  says  I’m  depressed  much  of  the  time. 

ELIZA4:  I AM  SORRY  TO  HEAR  YOU  ARE  DEPRESSED. 

Eliza  worked  by  having  a cascade  of  regular  expression  substitutions 
that  each  matched  some  paid  of  the  input  lines  and  changed  them.  The  first 
substitutions  changed  all  instances  of  my  to  YOUR , and  I’m  to  YOU  ARE , 
and  so  on.  The  next  set  of  substitutions  looked  for  relevant  patterns  in  the 
input  and  created  an  appropriate  output;  here  arc  some  examples: 
s/.*  YOU  ARE  (depressed | sad)  . */I  AM  SORRY  TO  HEAR  YOU  ARE  \1/ 
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s/.*  YOU  ARE  (depressed | sad)  ,*/WHY  DO  YOU  THINK  YOU  ARE  \1/ 
S/.*  all  . */IN  WHAT  WAY/ 

s/.*  always  ,*/CAN  YOU  THINK  OF  A SPECIFIC  EXAMPLE/ 

Since  multiple  substitutions  could  apply  to  a given  input,  substitutions 
were  assigned  a rank  and  were  applied  in  order.  Creation  of  such  patterns  is 
addressed  in  Exercise  2.2. 

2.2  Finite- State  Automata 


The  regular  expression  is  more  than  just  a convenient  metalanguage  for  text 
searching.  First,  a regular  expression  is  one  way  of  describing  a finite-state 
automaton  (FSA).  Finite-state  automata  arc  the  theoretical  foundation  of  automatoIF 
a good  deal  of  the  computational  work  we  will  describe  in  this  book.  Any  fsa 
regular  expression  can  be  implemented  as  a finite-state  automaton  (except 
regular  expressions  that  use  the  memory  feature;  more  on  this  later).  Sym- 
metrically, any  finite-state  automaton  can  be  described  with  a regular  expres- 
sion. Second,  a regular  expression  is  one  way  of  characterizing  a particular 
kind  of  formal  language  called  a regular  language.  Both  regular  expres-  language 
sions  and  finite-state  automata  can  be  used  to  described  regular  languages. 

The  relation  among  these  three  theoretical  constructions  is  sketched  out  in 
Figure  2.9. 


regular 

expressions 


finite  A regular 

automata  languages 


Figure  2.9  The  relationship  between  finite  automata,  regular  expressions, 
and  regular  languages;  figure  suggested  by  Martin  Kay. 


This  section  will  begin  by  introducing  finite-state  automata  for  some  of 
the  regular  expressions  from  the  last  section,  and  then  suggest  how  the  map- 
ping from  regular  expressions  to  automata  proceeds  in  general.  Although 
we  begin  with  their  use  for  implementing  regular  expressions,  FSAs  have  a 
wide  variety  of  other  uses  which  we  will  explore  in  this  chapter  and  the  next. 
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AUTOMATON 


STATE 

START  STATE 


Using  an  FSA  to  Recognize  Sheeptalk 

After  a while,  with  the  parrot ’s  help,  the  Doctor  got  to  learn  the 
language  of  the  animals  so  well  that  he  could  talk  to  them  himself 
and  understand  everything  they  said. 

Hugh  Lofting,  The  Story  of  Doctor  Dolittle 

Let’s  begin  with  the  ‘sheep  language’  we  discussed  previously.  Recall 
that  we  defined  the  sheep  language  as  any  string  from  the  following  (infinite) 
set: 


baa! 

baaa! 

baaaa! 

baaaaa! 

baaaaaa! 


The  regular  expression  for  this  kind  of  ‘sheep  talk’  is  /baa+  ! /.  Lig- 
ure  2.10  shows  an  automaton  for  modeling  this  regular  expression.  The 
automaton  (i.e.  machine,  also  called  finite  automaton,  finite-state  automa- 
ton, or  FSA)  recognizes  a set  of  strings,  in  this  case  the  strings  characterizing 
sheep  talk,  in  the  same  way  that  a regular  expression  does.  We  represent  the 
automaton  as  a directed  graph:  a finite  set  of  vertices  (also  called  nodes), 
together  with  a set  of  directed  links  between  pairs  of  vertices  called  arcs. 
We’ll  represent  vertices  with  circles  and  arcs  with  arrows.  The  automaton 
has  five  states,  which  arc  represented  by  nodes  in  the  graph.  State  0 is  the 
start  state  which  we  represent  by  the  incoming  arrow.  State  4 is  the  final 
state  or  accepting  state,  which  we  represent  by  the  double  circle.  It  also  has 
four  transitions,  which  we  represent  by  arcs  in  the  graph. 

The  LSA  can  be  used  for  recognizing  (we  also  say  accepting)  strings 
in  the  following  way.  Lirst,  think  of  the  input  as  being  written  on  a long  tape 
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broken  up  into  cells,  with  one  symbol  written  in  each  cell  of  the  tape,  as  in 
Figure  2.11. 


m 

;> 

) a b a ! b ^ 

Figure  2.11  A tape  with  cells. 

The  machine  starts  in  the  start  state  (go),  and  iterates  the  following 
process:  Check  the  next  letter  of  the  input.  If  it  matches  the  symbol  on 
an  arc  leaving  the  current  state,  then  cross  that  arc,  move  to  the  next  state, 
and  also  advance  one  symbol  in  the  input.  If  we  arc  in  the  accepting  state 
(r/4)  when  we  run  out  of  input,  the  machine  has  successfully  recognized  an 
instance  of  sheeptalk.  If  the  machine  never  gets  to  the  final  state,  either 
because  it  runs  out  of  input,  or  it  gets  some  input  that  doesn’t  match  an  arc 
(as  in  Figure  2. 1 1),  or  if  it  just  happens  to  get  stuck  in  some  non-final  state, 
we  say  the  machine  rejects  or  fails  to  accept  an  input. 

We  can  also  represent  an  automaton  with  a state-transition  table.  As 
in  the  graph  notation,  the  state-transition  table  represents  the  start  state,  the 
accepting  states,  and  what  transitions  leave  each  state  with  which  symbols. 
Here’s  the  state-transition  table  for  the  FSA  of  Figure  2. 10. 


Input 

State 

b 

a 

1 

0 

1 

0 

0 

1 

0 

2 

0 

2 

0 

3 

0 

3 

0 

3 

4 

4: 

0 

0 

0 

Figure  2.12:  The  state-transition  table  for  the  FSA  of  Figure  2.10 

We’ve  marked  state  4 with  a colon  to  indicate  that  it's  a final  state  (you 
can  have  as  many  final  states  as  you  want),  and  the  0 indicates  an  illegal  or 
missing  transition.  We  can  read  the  first  row  as  “if  we’re  in  state  0 and  we 
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see  the  input  b we  must  go  to  state  1.  If  we’re  in  state  0 and  we  see  the  input 
a or  !,  we  fail”. 

More  formally,  a finite  automaton  is  defined  by  the  following  5 param- 
eters: 

• Q:  a finite  set  of  N states  qo,qi , • • • qN 

• £:  a finite  input  alphabet  of  symbols 

• qo:  the  start  state 

• F:  the  set  of  final  states,  F C Q 

• 5 (q,i):  the  transition  function  or  transition  matrix  between  states.  Given 

a state  q C Q and  an  input  symbol  i G £,  5 (q,i)  returns  a new  state 

q'  G Q.  S is  thus  a relation  from  Q x £ to  Q\ 

For  the  sheeptalk  automaton  in  Figure  2.10,  Q = {^o,^i , 92,93, 

£ = {a.b. !},  F = {94},  and  S(q.i)  is  defined  by  the  transition  table  in  Fig- 
ure 2.12. 

Figure  2.13  presents  an  algorithm  for  recognizing  a string  using  a state- 
transition  table.  The  algorithm  is  called  D-RECOGNIZE  for  ‘deterministic 
determine-  recognizer’.  A deterministic  algorithm  is  one  that  has  no  choice  points; 

the  algorithm  always  knows  what  to  do  for  any  input.  The  next  section  will 
introduce  non-deterministic  automata  that  must  make  decisions  about  which 
states  to  move  to. 

D-RECOGNIZE  takes  as  input  a tape  and  an  automaton.  It  returns  ac- 
cept if  the  string  it  is  pointing  to  on  the  tape  is  accepted  by  the  automaton, 
and  reject  otherwise.  Note  that  since  D-RECOGNIZE  assumes  it  is  already 
pointing  at  the  string  to  be  checked,  its  task  is  only  a subpart  of  the  general 
problem  that  we  often  use  regular  expressions  for,  finding  a string  in  a corpus 
(the  general  problem  is  left  as  an  exercise  to  the  reader  in  Exercise  2.8). 

D-RECOGNIZE  begins  by  initializing  the  variables  index  and  current- 
state  to  the  beginning  of  the  tape  and  the  machine’s  initial  state.  D-RECOGNIZE 
then  enters  a loop  that  drives  the  rest  of  the  algorithm.  It  first  checks  whether 
it  has  reached  the  end  of  its  input.  If  so,  it  either  accepts  the  input  (if  the  cur- 
rent state  is  an  accept  state)  or  rejects  the  input  (if  not). 

If  there  is  input  left  on  the  tape,  D-RECOGNIZE  looks  at  the  transition 
table  to  decide  which  state  to  move  to.  The  variable  current-state  indicates 
which  row  of  the  table  to  consult,  while  the  current  symbol  on  the  tape  indi- 
cates which  column  of  the  table  to  consult.  The  resulting  transition-table  cell 
is  used  to  update  the  variable  current-state  and  index  is  incremented  to  move 
forward  on  the  tape.  If  the  transition-table  cell  is  empty  then  the  machine 
has  nowhere  to  go  and  must  reject  the  input. 
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function  D-RECOGNlZE(fape,  machine ) returns  accept  or  reject 

index  t—  Beginning  of  tape 
current-state  •<—  Initial  state  of  machine 

loop 

if  End  of  input  has  been  reached  then 
if  current-state  is  an  accept  state  then 
return  accept 
else 

return  reject 

elsif  transition-table[current-state,tape[index]]  is  empty  then 
return  reject 
else 

current-state  4—  transition-table  [current-state,  tape  [index]  ] 
index  t—  index  + 1 

end 


Figure  2.13  An  algorithm  for  deterministic  recognition  of  FSAs.  This  al- 
gorithm returns  accept  if  the  entire  string  it  is  pointing  at  is  in  the  language 
defined  by  the  FSA,  and  reject  if  the  string  is  not  in  the  language. 


Figure  2.14  traces  the  execution  of  this  algorithm  on  the  sheep  lan- 
guage FSA  given  the  sample  input  string  baaa!. 


I 

1 

.^1  ■ 

M 

■a 

b 

a 

a 

a 

t 

• 

3 

Figure  2.14  Tracing  the  execution  of  FSA  #1  on  some  sheeptalk. 

Before  examining  the  beginning  of  the  tape,  the  machine  is  in  state  qo. 
Finding  a b on  input  tape,  it  changes  to  state  q\  as  indicated  by  the  contents 
of  transition-table [qo,b]  in  Figure  2.12  on  page  35.  It  then  finds  an  a and 
switches  to  state  c/2,  another  a puts  it  in  state  cp,  a third  a leaves  it  in  state  <73, 
where  it  reads  the  T,  and  switches  to  state  cp.  Since  there  is  no  more  input, 
the  End  of  input  condition  at  the  beginning  of  the  loop  is  satisfied  for 
the  first  time  and  the  machine  halts  in  q 4.  State  cp  is  an  accepting  state, 
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and  so  the  machine  has  accepted  the  string  baaa!  as  a sentence  in  the  sheep 
language. 

The  algorithm  will  fail  whenever  there  is  no  legal  transition  for  a given 
combination  of  state  and  input.  The  input  abc  will  fail  to  be  recognized  since 
there  is  no  legal  transition  out  of  state  qo  on  the  input  a,  (i.e.  this  entry  of 
the  transition  table  in  Figure  2.12  on  page  35  has  a 0).  Even  if  the  automaton 
had  allowed  an  initial  a it  would  have  certainly  failed  on  c,  since  c isn’t  even 
in  the  sheeptalk  alphabet!).  We  can  think  of  these  ‘empty’  elements  in  the 
table  as  if  they  all  pointed  at  one  ‘empty’  state,  which  we  might  call  the  fail 
fail  state  state  or  sink  state.  In  a sense  then,  we  could  view  any  machine  with  empty 
transitions  as  if  we  had  augmented  it  with  a fail  state,  and  drawn  in  all  the 
extra  arcs,  so  we  always  had  somewhere  to  go  from  any  state  on  any  possible 
input.  Just  for  completeness,  Figure  2.15  shows  the  FSA  from  Figure  2.10 
with  the  fail  state  qp  tilled  in. 


Formal  Languages 

We  can  use  the  same  graph  in  Figure  2. 10  as  an  automaton  for  GENERATING 
sheeptalk.  If  we  do,  we  would  say  that  the  automaton  starts  at  state  qo,  and 
crosses  arcs  to  new  states,  printing  out  the  symbols  that  label  each  arc  it 
follows.  When  the  automaton  gets  to  the  final  state  it  stops.  Notice  that  at 
state  3,  the  automaton  has  to  chose  between  printing  out  a ! and  going  to 
state  4,  or  printing  out  an  a and  returning  to  state  3.  Let’s  say  for  now  that 
we  don’t  care  how  the  machine  makes  this  decision;  maybe  it  flips  a coin. 
For  now,  we  don’t  care  which  exact  string  of  sheeptalk  we  generate,  as  long 


Section  2.2. 


Finite-State  Automata 


39 


as  it’s  a string  captured  by  the  regular  expression  for  sheeptalk  above. 

Key  Concept  #l.Formal  Language:  A model  which  can  both  gener- 
ate and  recognize  all  and  only  the  strings  of  a formal  language  acts  as 

a definition  of  the  formal  language. 

A formal  language  is  a set  of  strings,  each  string  composed  of  symbols  language 
from  a finite  symbol-set  called  an  alphabet  (the  same  alphabet  used  above  alphabet 
for  defining  an  automaton!).  The  alphabet  for  the  sheep  language  is  the  set 
£ = {a.b. !}.  Given  a model  m (such  as  a particular  FSA),  we  can  use  L(m) 
to  mean  “the  formal  language  characterized  by  m”.  So  the  formal  language 
defined  by  our  sheeptalk  automaton  m in  Figure  2.10  (and  Figure  2.12)  is  the 
infinite  set: 

L(m)  = {baa\,baaal.baaaa\.baaaaal.baaaaaal . . .}  (2.1) 

The  usefulness  of  an  automaton  for  defining  a language  is  that  it  can 
express  an  infinite  set  (such  as  this  one  above)  in  a closed  form.  Formal 
languages  arc  not  the  same  as  natural  languages,  which  arc  the  kind  of  languages 
languages  that  real  people  speak.  In  fact  a formal  language  may  bear  no  re- 
semblance at  all  to  a real  language  (for  example  a formal  language  can  be 
used  to  model  the  different  states  of  a soda  machine).  But  we  often  use  a 
formal  language  to  model  paid  of  a natural  language,  such  as  parts  of  the 
phonology,  morphology,  or  syntax.  The  term  generative  grammar  is  some- 
times used  in  linguistics  to  mean  a grammar  of  a formal  language;  the  origin 
of  the  term  is  this  use  of  an  automaton  to  define  a language  by  generating  all 
possible  strings. 

Another  Example 

In  the  previous  examples  our  formal  alphabet  consisted  of  letters;  but  we 
can  also  have  a higher-level  alphabet  consisting  of  words.  In  this  way  we 
can  write  finite-state  automata  that  model  facts  about  word  combinations. 

For  example,  suppose  we  wanted  to  build  an  FSA  that  modeled  the  subpart 
of  English  dealing  with  amounts  of  money.  Such  a formal  language  would 
model  the  subset  of  English  consisting  of  phrases  like  ten  cents,  three  dol- 
lars, one  dollar  thirty-five  cents  and  so  on. 

We  might  break  this  down  by  first  building  just  the  automaton  to  ac- 
count for  the  numbers  from  one  to  ninety-nine,  since  we'll  need  them  to  deal 
with  cents.  Figure  2.16  shows  this. 


40 


Chapter  2.  Regular  Expressions  and  Automata 


NON- 

DETERMINISTIC 

NFSA 


one 

two 

three 

four 

five 

six 

seven 

eight 

nine 

ten 

twenty 

thirty 

forty 

fifty 

sixty 

seventy 

eighty 

ninety 

eleven 

twelve 

thirteen 

fourteen 

fifteen 

sixteen 

seventeen 

eighteen 

nineteen 

I ^ 

twenty 

thirty 

forty 

fifty 

sixty 

seventy 

eighty 

ninety 

0 

— ^ one 
two 

' three 

four 
five 

seven  (f  q \\ 

eight  V l h2  Jj 

nine 

Figure  2.16 

An  FSA  for  the  words  for  English  numbers  1 - 99. 

We  could  now  add  cents  and  dollars  to  our  automaton.  Figure  2.17 
shows  a simple  version  of  this,  where  we  just  made  two  copies  of  the  au- 
tomaton in  Figure  2.16  and  appended  the  words  cents  and  dollars. 


one  six  ten  sixty  eleven  sixteen 

two  seven  twenty  seventy  twelve  seventeen 

three  eight  thirty  eighty  thirteen  eighteen 

four  nine  forty  ninety  fourteen  nineteen 

five  fifty  fifteen  cen 


one  six  ten  sixty  eleven  sixteen 
two  seven  twenty  seventy  twelve  seventeen 

three  eight  thirty  eighty  thirteen  eighteen 

four  nine  forty  ninety  fourteen  nineteen 

five  fifty  fifteen 


Figure  2.17  FSA  for  the  simple  dollars  and  cents. 


We  would  now  need  to  add  in  the  grammar  for  different  amounts  of 
dollars;  including  higher  numbers  like  hundred , thousand.  We’d  also  need  to 
make  sure  that  the  nouns  like  cents  and  dollars  arc  singular  when  appropriate 
{one  cent , one  dollar),  and  plural  when  appropriate  {ten  cents,  two  dollars). 
This  is  left  as  an  exercise  for  the  reader  (Exercise  2.3).  We  can  think  of  the 
FS  As  in  Figure  2. 16  and  Figure  2. 17  as  simple  grammars  of  parts  of  English. 
We  will  return  to  grammar-building  in  Paid  II  of  this  book,  particularly  in 
Chapter  9. 

Nondeterministic  FSAs 

Fet’s  extend  our  discussion  now  to  another  class  of  FSAs:  non-deterministic 
FSAs  (or  NFSAs).  Consider  the  sheeptalk  automaton  in  Figure  2.18,  which 
is  much  like  our  first  automaton  in  Figure  2. 10: 
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Figure  2.18  A non-deterministic  finite-state  automaton  for  talking  sheep 
(NFSA  #1).  Compare  with  the  deterministic  automaton  in  Figure  2.10. 

The  only  difference  between  this  automaton  and  the  previous  one  is 
that  here  in  Figure  2.18  the  self-loop  is  on  state  2 instead  of  state  3.  Con- 
sider using  this  network  as  an  automaton  for  recognizing  sheeptalk.  When 
we  get  to  state  2,  if  we  see  an  a we  don’t  know  whether  to  remain  in  state 
2 or  go  on  to  state  3.  Automata  with  decision  points  like  this  arc  called 
non-deterministic  FSAs  (or  NFSAs).  Recall  by  contrast  that  Figure  2.10 
specified  a deterministic  automaton,  i.e.  one  whose  behavior  during  recog- 
nition is  fully  determined  by  the  state  it  is  in  and  the  symbol  it  is  looking  at. 
A deterministic  automaton  can  be  referred  to  as  a DFSA.  That  is  not  true  for 
the  machine  in  Figure  2.18  (NFSA#1). 

There  is  another  common  type  of  non-determinism,  which  can  be  caused 
by  arcs  that  have  no  symbols  on  them  (called  £-transitions).  The  automaton 
in  Figure  2.19  defines  the  exact  same  language  as  the  last  one,  or  our  first 
one,  but  it  does  it  with  an  £-transition. 


We  interpret  this  new  arc  as  follows:  if  we  arc  in  state  3,  we  arc  al- 
lowed to  move  to  state  2 without  looking  at  the  input,  or  advancing  our  input 
pointer.  So  this  introduces  another  kind  of  non-determinism  - we  might  not 
know  whether  to  follow  the  £-transition  or  the  ! arc. 


NON- 
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FSA 

NFSA 


DFSA 
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Using  an  NFSA  to  accept  strings 

If  we  want  to  know  whether  a string  is  an  instance  of  sheeptalk  or  not,  and 
if  we  use  a non-deterministic  machine  to  recognize  it,  we  might  follow  the 
wrong  arc  and  reject  it  when  we  should  have  accepted  it.  That  is,  since  there 
is  more  than  one  choice  at  some  point,  we  might  take  the  wrong  choice.  This 
problem  of  choice  in  non-deterministic  models  will  come  up  again  and  again 
as  we  build  computational  models,  particularly  for  parsing. 

There  arc  three  standard  solutions  to  this  problem: 

• Backup:  Whenever  we  come  to  a choice  point,  we  could  put  a marker 
to  mark  where  we  were  in  the  input,  and  what  state  the  automaton  was 
in.  Then  if  it  turns  out  that  we  took  the  wrong  choice,  we  could  back 
up  and  try  another  path. 

• Look-ahead:  We  could  look  ahead  in  the  input  to  help  us  decide  which 
path  to  take. 

• Parallelism:  Whenever  we  come  to  a choice  point,  we  could  look  at 
every  alternative  path  in  parallel. 

We  will  focus  here  on  the  backup  approach  and  defer  discussion  of  the 
look-ahead  and  parallelism  approaches  to  later  chapters. 

The  backup  approach  suggests  that  we  should  blithely  make  choices 
that  might  lead  to  deadends,  knowing  that  we  can  always  return  to  unex- 
plored alternative  choices.  There  arc  two  keys  to  this  approach:  we  need 
to  remember  all  the  alternatives  for  each  choice  point,  and  we  need  to  store 
sufficient  information  about  each  alternative  so  that  we  can  return  to  it  when 
necessary.  When  a backup  algorithm  reaches  a point  in  its  processing  where 
no  progress  can  be  made  (because  it  runs  out  of  input,  or  has  no  legal  tran- 
sitions), it  returns  to  a previous  choice  point,  selects  one  of  the  unexplored 
alternatives,  and  continues  from  there.  Applying  this  notion  to  our  non- 
deterministic  recognizer,  we  need  only  remember  two  things  for  each  choice 
point:  the  state,  or  node,  of  the  machine  that  we  can  go  to  and  the  corre- 
sponding position  on  the  tape.  We  will  call  the  combination  of  the  node  and 
position  the  search-state  of  the  recognition  algorithm.  To  avoid  confusion, 
we  will  refer  to  the  state  of  the  automaton  (as  opposed  to  the  state  of  the 
search)  as  a node  or  a machine-state.  Figure  2.21  presents  a recognition 
algorithm  based  on  this  approach. 

Before  going  on  to  describe  the  main  paid  of  this  algorithm,  we  should 
note  two  changes  to  the  transition  table  that  drives  it.  First,  in  order  to  rep- 
resent nodes  that  have  outgoing  £-transitions,  we  add  a new  £-column  to  the 
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Input 

State 

b 

a 

I 

£ 

0 

1 

0 

0 

0 

1 

0 

2 

0 

0 

2 

0 

2,3 

0 

0 

3 

0 

0 

4 

0 

4: 

0 

0 

0 

0 

Figure  2.20:  The  transition  table  from  NFSA  #1  in  Figure  2.18. 

transition  table.  If  a node  has  an  £-transition,  we  list  the  destination  node  in 
the  £-column  for  that  node’s  row.  The  second  addition  is  needed  to  account 
for  multiple  transitions  to  different  nodes  from  the  same  input  symbol.  We 
let  each  cell  entry  consist  of  a list  of  destination  nodes  rather  than  a single 
node.  Figure  2.20  shows  the  transition  table  for  the  machine  in  Figure  2.18 
(NFSA  #1).  While  it  has  no  £-transitions,  it  does  show  that  in  machine-state 
c/2  the  input  a can  lead  back  to  qi  or  on  to  <73. 

Figure  2.21  shows  the  algorithm  for  using  a non-deterministic  FSA 
to  recognize  an  input  string.  The  function  ND-RECOGNIZE  uses  the  variable 
agenda  to  keep  track  of  all  the  currently  unexplored  choices  generated  during 
the  course  of  processing.  Each  choice  (search  state)  is  a tuple  consisting  of  a 
node  (state)  of  the  machine  and  a position  on  the  tape.  The  variable  current- 
search-state  represents  the  branch  choice  being  currently  explored. 

ND-RECOGNIZE  begins  by  creating  an  initial  search-state  and  placing 
it  on  the  agenda.  For  now  we  don’t  specify  what  order  the  search-states  arc 
placed  on  the  agenda.  This  search-state  consists  of  the  initial  machine-state 
of  the  machine  and  a pointer  to  the  beginning  of  the  tape.  The  function  NEXT 
is  then  called  to  retrieve  an  item  from  the  agenda  and  assign  it  to  the  variable 
current- search- state. 

As  with  D-RECOGNIZE,  the  first  task  of  the  main  loop  is  to  determine 
if  the  entire  contents  of  the  tape  have  been  successfully  recognized.  This 
is  done  via  a call  to  ACCEPT-STATE?,  which  returns  accept  if  the  current 
search-state  contains  both  an  accepting  machine-state  and  a pointer  to  the 
end  of  the  tape.  If  we’re  not  done,  the  machine  generates  a set  of  possible 
next  steps  by  calling  GENERATE-NEW-STATES,  which  creates  search-states 
for  any  £-transitions  and  any  normal  input-symbol  transitions  from  the  tran- 
sition table.  All  of  these  search-state  tuples  arc  then  added  to  the  current 
agenda. 

Finally,  we  attempt  to  get  a new  search-state  to  process  from  the  agenda. 
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If  the  agenda  is  empty  we’ve  run  out  of  options  and  have  to  reject  the  input. 
Otherwise,  an  unexplored  option  is  selected  and  the  loop  continues. 

It  is  important  to  understand  why  ND-RECOGNIZE  returns  a value  of 
reject  only  when  the  agenda  is  found  to  be  empty.  Unlike  D-RECOGNIZE,  it 
does  not  return  reject  when  it  reaches  the  end  of  the  tape  in  an  non-accept 
machine-state  or  when  it  finds  itself  unable  to  advance  the  tape  from  some 
machine-state.  This  is  because,  in  the  non-deterministic  case,  such  road- 
blocks only  indicate  failure  down  a given  path,  not  overall  failure.  We  can 
only  be  sure  we  can  reject  a string  when  all  possible  choices  have  been  ex- 
amined and  found  lacking. 

Figure  2.22  illustrates  the  progress  of  ND-RECOGNIZE  as  it  attempts  to 
handle  the  input  baaa ! . Each  strip  illustrates  the  state  of  the  algorithm  at 
a given  point  in  its  processing.  The  current-search-state  variable  is  captured 
by  the  solid  bubbles  representing  the  machine-state  along  with  the  arrow  rep- 
resenting progress  on  the  tape.  Each  strip  lower  down  in  the  figure  represents 
progress  from  one  current-search-state  to  the  next. 

Little  of  interest  happens  until  the  algorithm  finds  itself  in  state  qi 
while  looking  at  the  second  a on  the  tape.  An  examination  of  the  entry 
for  transition-tablc[c/2,a]  returns  both  qi  and  q^.  Search  states  arc  created 
for  each  of  these  choices  and  placed  on  the  agenda.  Unfortunately,  our  al- 
gorithm chooses  to  move  to  state  <73,  a move  that  results  in  neither  an  accept 
state  nor  any  new  states  since  the  entry  for  transition-table  [^3,  a]  is  empty. 
At  this  point,  the  algorithm  simply  asks  the  agenda  for  a new  state  to  pursue. 
Since  the  choice  of  returning  to  qi  from  q^  is  the  only  unexamined  choice  on 
the  agenda  it  is  returned  with  the  tape  pointer  advanced  to  the  next  a.  Some- 
what diabolically,  ND-RECOGNIZE  finds  itself  faced  with  the  same  choice. 
The  entry  for  transition-table^  ,a]  still  indicates  that  looping  back  to  <72  or 
advancing  to  <73  arc  valid  choices.  As  before,  states  representing  both  arc 
placed  on  the  agenda.  These  search  states  arc  not  the  same  as  the  previous 
ones  since  their  tape  index  values  have  advanced.  This  time  the  agenda  pro- 
vides the  move  to  <73  as  the  next  move.  The  move  to  <74,  and  success,  is  then 
uniquely  determined  by  the  tape  and  the  transition-table. 

Recognition  as  Search 

ND-RECOGNIZE  accomplishes  the  task  of  recognizing  strings  in  a regular 
language  by  providing  a way  to  systematically  explore  all  the  possible  paths 
through  a machine.  If  this  exploration  yields  a path  ending  in  an  accept 
state,  it  accepts  the  string,  otherwise  it  rejects  it.  This  systematic  exploration 


function  ND-RECOGNlZE(fape,  machine ) returns  accept  or  reject 


agenda  •<—  {(Initial  state  of  machine,  beginning  of  tape)} 
current-search-state  -^NEXT(agenda) 

loop 

if  Accept-State l(current-search-state)  returns  true  then 
return  accept 
else 

agenda  ^agenda  U GENERATE-NEW-STATES(cMrrenf-searc/z-sfafe) 
if  agenda  is  empty  then 
return  reject 
else 

current-search-state  ^NEXJ(agenda) 

end 

function  Gene r at i - N e w- States) c urrent-state)  returns  a set  of  search- 
states 

current-node  -<r-  the  node  the  current  search-state  is  in 

index  -t—  the  point  on  the  tape  the  current  search-state  is  looking  at 

return  a list  of  search  states  from  transition  table  as  follows: 

(transition- table [ current-node,  £],  index ) 

U 

(transition-table [current-node,  tape[index]J,  index  + 7) 
function  ACCEPT-  S TATE  [(search-state)  returns  true  or  false 

current-node  the  node  search-state  is  in 

index  -t—  the  point  on  the  tape  search-state  is  looking  at 

if  index  is  at  the  end  of  the  tape  and  current-node  is  an  accept  state  of  machine 

then 

return  true 
else 

return  false 


Figure  2.21  An  algorithm  for  NFSA  recognition.  The  word  node  means 
a state  of  the  FSA,  while  state  or  search-state  means  ‘the  state  of  the  search 
process’,  i.e.  a combination  of  node  and  tape-position 

is  made  possible  by  the  agenda  mechanism,  which  on  each  iteration  selects  a 
partial  path  to  explore  and  keeps  track  of  any  remaining,  as  yet  unexplored, 
partial  paths. 

Algorithms  such  as  ND-RECOGNIZE,  which  operate  by  systematically 
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SEARCH 


searching  for  solutions,  arc  known  as  state-space  search  algorithms.  In 
such  algorithms,  the  problem  definition  creates  a space  of  possible  solu- 
tions; the  goal  is  to  explore  this  space,  returning  an  answer  when  one  is 
found  or  rejecting  the  input  when  the  space  has  been  exhaustively  explored. 
In  ND-RECOGNIZE,  search  states  consist  of  pairings  of  machine-states  with 
positions  on  the  input  tape.  The  state-space  consists  of  all  the  pairings  of 
machine-state  and  tape  positions  that  arc  possible  given  the  machine  in  ques- 
tion. The  goal  of  the  search  is  to  navigate  through  this  space  from  one  state  to 
another  looking  for  a pairing  of  an  accept  state  with  an  end  of  tape  position. 

The  key  to  the  effectiveness  of  such  programs  is  often  the  order  in 
which  the  states  in  the  space  arc  considered.  A poor  ordering  of  states  may 
lead  to  the  examination  of  a large  number  of  unfruitful  states  before  a suc- 
cessful solution  is  discovered.  Unfortunately,  it  is  typically  not  possible  to 
tell  a good  choice  from  a bad  one,  and  often  the  best  we  can  do  is  to  insure 
that  each  possible  solution  is  eventually  considered. 


Section  2.2. 
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Careful  readers  may  have  noticed  that  the  ordering  of  states  in  ND- 
RECOGNIZE  has  been  left  unspecified.  We  know  only  that  unexplored  states 
arc  added  to  the  agenda  as  they  arc  created  and  that  the  (undefined)  func- 
tion Next  returns  an  unexplored  state  from  the  agenda  when  asked.  How 
should  the  function  Next  be  defined?  Consider  an  ordering  strategy  where 
the  states  that  arc  considered  next  arc  the  most  recently  created  ones.  Such 
a policy  can  be  implemented  by  placing  newly  created  states  at  the  front 
of  the  agenda  and  having  Next  return  the  state  at  the  front  of  the  agenda 
when  called.  Thus  the  agenda  is  implemented  by  a stack.  This  is  commonly 
referred  to  as  a depth-first  search  or  Last  In  First  Out  (LIFO)  strategy. 

Such  a strategy  dives  into  the  search  space  following  newly  developed 
leads  as  they  arc  generated.  It  will  only  return  to  consider  earlier  options 
when  progress  along  a current  lead  has  been  blocked.  The  trace  of  the  ex- 
ecution of  ND-RECOGNIZE  on  the  string  baaa ! as  shown  in  Figure  2.22 
illustrates  a depth-first  search.  The  algorithm  hits  the  first  choice  point  after 
seeing  ba  when  it  has  to  decide  whether  to  stay  in  q2  or  advance  to  state 
c/3.  At  this  point,  it  chooses  one  alternative  and  follows  it  until  it  is  sure  it’s 
wrong.  The  algorithm  then  backs  up  and  tries  another  older  alternative. 

Depth  first  strategies  have  one  major  pitfall:  under  certain  circum- 
stances they  can  enter  an  infinite  loop.  This  is  possible  either  if  the  search 
space  happens  to  be  set  up  in  such  a way  that  a search-state  can  be  acciden- 
tally re-visited,  or  if  there  arc  an  infinite  number  of  search  states.  We  will 
revisit  this  question  when  we  turn  to  more  complicated  search  problems  in 
parsing  in  Chapter  10. 

The  second  way  to  order  the  states  in  the  search  space  is  to  consider 
states  in  the  order  in  which  they  arc  created.  Such  a policy  can  be  imple- 
mented by  placing  newly  created  states  at  the  back  of  the  agenda  and  still 
have  Next  return  the  state  at  the  front  of  the  agenda.  Thus  the  agenda  is 
implemented  via  a queue.  This  is  commonly  referred  to  as  a breadth-first 
search  or  First  In  First  Out  (FIFO)  strategy.  Consider  a different  trace 
of  the  execution  of  ND-RECOGNIZE  on  the  string  baaa  ! as  shown  in  Fig- 
ure 2.23.  Again,  the  algorithm  hits  its  first  choice  point  after  seeing  ba  when 
it  had  to  decide  whether  to  stay  in  q2  or  advance  to  state  q^.  But  now  rather 
than  picking  one  choice  and  following  it  up,  we  imagine  examining  all  pos- 
sible choices,  expanding  one  ply  of  the  search  tree  at  a time. 

Like  depth-first  search,  breadth-first  search  has  its  pitfalls.  As  with 
depth-first  if  the  state-space  is  infinite,  the  search  may  never  terminate.  More 
importantly,  due  to  growth  in  the  size  of  the  agenda  if  the  state-space  is 
even  moderately  large,  the  search  may  require  an  impractically  large  amount 
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of  memory.  For  small  problems,  either  depth-first  or  breadth-first  search 
strategies  may  be  adequate,  although  depth-first  is  normally  preferred  for  its 
more  efficient  use  of  memory.  For  larger  problems,  more  complex  search 
techniques  such  as  dynamic  programming  or  A*  must  be  used,  as  we  will 
see  in  Chapter  7 and  Chapter  10. 

Relating  Deterministic  and  Non-deterministic  Automata 

It  may  seem  that  allowing  NFSAs  to  have  non-deterministic  features  like  e- 
transitions  would  make  them  more  powerful  than  DFSAs.  In  fact  this  is  not 
the  case;  for  any  NFS  A,  there  is  an  exactly  equivalent  DFSA.  In  fact  there  is 
a simple  algorithm  for  converting  an  NFSA  to  an  equivalent  DFSA,  although 
the  number  of  states  in  this  equivalent  deterministic  automaton  may  be  much 
larger.  See  Fewis  and  Papadimitriou  (1981)  or  Hopcroft  and  Ullman  (1979) 
for  the  proof  of  the  correspondence.  The  basic  intuition  of  the  proof  is  worth 
mentioning,  however,  and  builds  on  the  way  NFSAs  parse  their  input.  Recall 
that  the  difference  between  NFSAs  and  DFSAs  is  that  in  an  NFSA  a state  qj 
may  have  more  than  one  possible  next  state  given  an  input  i (for  example  qa 
and  qb).  The  algorithm  in  Figure  2.21  dealt  with  this  problem  by  choosing 
either  qa  or  qi,  and  then  backtracking  if  the  choice  turned  out  to  be  wrong. 
We  mentioned  that  a parallel  version  of  the  algorithm  would  follow  both 
paths  (toward  qa  and  qb)  simultaneously. 
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The  algorithm  for  converting  a NFSA  to  a DFSA  is  like  this  parallel 
algorithm;  we  build  an  automaton  that  has  a deterministic  path  for  every  path 
our  parallel  recognizer  might  have  followed  in  the  search  space.  We  imagine 
following  both  paths  simultaneously,  and  group  together  into  an  equivalence 
class  all  the  states  we  reach  on  the  same  input  symbol  (i.e.  qa  and  q b).  We 
now  give  a new  state  label  to  this  new  equivalence  class  state  (for  example 
qah).  We  continue  doing  this  for  every  possible  input  for  every  possible  group 
of  states.  The  resulting  DFSA  can  have  as  many  states  as  there  arc  distinct 
sets  of  states  in  the  original  NFSA.  The  number  of  different  subsets  of  a set 
with  N elements  is  2N , hence  the  new  DFSA  can  have  as  many  as  2N  states. 


2.3  Regular  Languages  and  FSAs 

As  we  suggested  above,  the  class  of  languages  that  arc  definable  by  regular 
expressions  is  exactly  the  same  as  the  class  of  languages  that  arc  character- 
izable  by  finite-state  automata  (whether  deterministic  or  non-deterministic). 

Because  of  this,  we  call  these  languages  the  regular  languages.  In  order  to  languages 

give  a formal  definition  of  the  class  of  regular  languages,  we  need  to  refer 

back  to  two  earlier  concepts:  the  alphabet  £.  which  is  the  set  of  all  symbols  in 

the  language,  and  the  empty  string  8,  which  is  conventionally  not  included  in 

£.  In  addition,  we  make  reference  to  the  empty  set  0 (which  is  distinct  from 

8).  The  class  of  regular  languages  (or  regular  sets)  over  £ is  then  formally 

as  follows:  1 

1.  0 is  a regular  language 

2.  Vn  G £ U 8,  {«}  is  a regular  language 

3.  If  L\  and  L2  are  regular  languages,  then  so  arc: 

(a)  L\  ■ Li  = {xy  \x  G Z-i, y G Li } . the  concatenation  of  L\  andL2 

(b)  L,  UL2,  the  union  or  disjunction  of  L\  andL? 

(c)  Lj,  the  Kleene  closure  of  L\ 

All  and  only  the  sets  of  languages  which  meet  the  above  properties 
arc  regular  languages.  Since  the  regular  languages  arc  the  set  of  languages 
characterizable  by  regular  expressions,  it  must  be  the  case  that  all  the  regu- 
lar- expression  operators  introduced  in  this  chapter  (except  memory)  can  be 
implemented  by  the  three  operations  which  define  regular  languages:  con- 

1 Following  van  Santen  and  Sproat  (1998),  Kaplan  and  Kay  (1994),  and  Lewis  and  Pa- 
padimitriou  (1981). 
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catenation,  disjunction/union  (also  called  and  Kleene  closure.  For  ex- 
ample all  the  counters  (*,+,  { n , m } ) arc  just  a special  case  of  repetition  plus 
Kleene  *.  All  the  anchors  can  be  thought  of  as  individual  special  symbols. 
The  square  braces  []  arc  a kind  of  disjunction  (i.c.  [ab]  means  “a  or  ft”,  or 
the  disjunction  of  a and  ft).  Thus  it  is  true  that  any  regular  expression  can  be 
turned  into  a (perhaps  larger)  expression  which  only  makes  use  of  the  three 
primitive  operations. 

Regular  languages  arc  also  closed  under  the  following  operations  (where 
L*  means  the  infinite  set  of  all  possible  strings  formed  from  the  alphabet  £): 

• intersection:  if  L\  and  Li  arc  regular  languages,  then  so  is  L\  C\La,  the 
language  consisting  of  the  set  of  strings  that  arc  in  both  L\  and  Lt. 

• difference:  if  L\  and  Li  arc  regular  languages,  then  so  is  L\  — L2,  the 
language  consisting  of  the  set  of  strings  that  arc  in  L\  but  not  L?. 

• complementation:  If  L\  is  a regular  language,  then  so  is  £*  — L\,  the 
set  of  all  possible  strings  that  aren’t  in  L\ 

• reversal:  If  L\  is  a regular  language,  then  so  is  Lf,  the  language  con- 
sisting of  the  set  of  reversals  of  all  the  strings  in  L\. 

The  proof  that  regular  expressions  arc  equivalent  to  finite-state  au- 
tomata can  be  found  in  Hopcroft  and  Ullman  (1979),  and  has  two  parts: 
showing  that  an  automaton  can  be  built  for  each  regular  language,  and  con- 
versely that  a regular  language  can  be  built  for  each  automaton.  We  won’t 
give  the  proof,  but  we  give  the  intuition  by  showing  how  to  do  the  first  paid: 
take  any  regular  expression  and  build  an  automaton  from  it.  The  intuition  is 
inductive:  for  the  base  case  we  build  an  automaton  to  correspond  to  regular 
expressions  of  a single  symbol  (e.g.  the  expression  ci)  by  creating  an  initial 
state  and  an  accepting  final  state,  with  an  arc  between  them  labeled  a.  For 
the  inductive  step,  we  show  that  each  of  the  primitive  operations  of  a regular 
expression  (concatenation,  union,  closure)  can  be  imitated  by  an  automaton: 

• concatenation:  We  just  string  two  FSAs  next  to  each  other  by  con- 
necting all  the  final  states  of  FSAi  to  the  initial  state  of  FSA2  by  an 
e-transition. 

• closure:  We  connect  all  the  final  states  of  the  FSA  back  to  the  initial 
states  by  e-transitions  (this  implements  the  repetition  paid  of  the  Kleene 
*),  and  then  put  direct  links  between  the  initial  and  final  states  by  e- 
transitions  (this  implements  the  possibly  of  having  zero  occurrences). 
We’d  leave  out  this  last  paid  to  implement  Kleene -plus  instead. 

• union:  We  add  a single  new  initial  state  q'0,  and  add  new  transitions 
from  it  to  all  the  former  initial  states  of  the  two  machines  to  be  joined. 
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Figure  2.25  The  closure  (Kleene  *)  of  an  FSA. 


2.4  Summary 

This  chapter  introduced  the  most  important  fundamental  concept  in  language 
processing,  the  finite  automaton,  and  the  practical  tool  based  on  automaton, 
the  regular  expression.  Here’s  a summary  of  the  main  points  we  covered 
about  these  ideas: 

• the  regular  expression  language  is  a powerful  tool  for  pattern-matching. 
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• basic  operations  in  regular  expressions  include  concatenation  of  sym- 
bols, disjunction  of  symbols  ([],  |,  and  .),  counters  (*,  +,  and 
{ n , m } ),  anchors  ( ~ , $ ) and  precedence  operators  ((,)). 

• any  regular  expression  can  be  realized  as  a finite  automaton. 

• memory  ( \ 1 together  with  ( ) ) is  an  advanced  operation  which  is  often 
considered  paid  of  regular  expressions,  but  which  cannot  be  realized  as 
a finite  automaton. 

• an  automaton  implicitly  defines  a formal  language  as  the  set  of  strings 
the  automaton  accepts. 

• an  automaton  can  use  any  set  of  symbols  for  its  vocabulary,  including 
letters,  words,  or  even  graphic  images. 

• the  behavior  of  a deterministic  automata  (DFSA)  is  fully  determined 
by  the  state  it  is  in. 

• a non-deterministic  (NFSA)  automata  sometimes  has  to  make  a choice 
between  multiple  paths  to  take  given  the  same  current  state  and  next  in- 
put. 

• any  NFSA  can  be  converted  to  a DFSA. 

• the  order  in  which  a NFSA  chooses  the  next  state  to  explore  on  the 
agenda  defines  its  search  strategy.  The  depth-first  search  or  LIFO 
strategy  corresponds  to  the  agenda-as-stack;  the  breadth-first  search 
or  FIFO  strategy  corresponds  to  the  agenda-as-queue. 

• any  regular  expression  can  be  automatically  compiled  into  a NFSA  and 
hence  into  a FSA. 


Bibliographical  and  Historical  Notes 

Finite  automata  arose  in  the  1950’s  out  of  Turing’s  (1936)  model  of  algo- 
rithmic computation,  considered  by  many  to  be  the  foundation  of  modern 
computer  science.  The  Turing  machine  was  an  abstract  machine  with  a finite 
control  and  an  input/output  tape.  In  one  move,  the  Turing  machine  could 
read  a symbol  on  the  tape,  write  a different  symbol  on  the  tape,  change  state, 
and  move  left  or  right.  (Thus  the  Turing  machine  differs  from  a finite-state 
automaton  mainly  in  its  ability  to  change  the  symbols  on  its  tape). 

Inspired  by  Turing’s  work,  McCulloch  and  Pitts  built  an  automata-like 
model  of  the  neuron  (see  von  Neumann,  1963,  p.  319).  Their  model,  which 
is  now  usually  called  the  McCulloch-Pitts  neuron  (McCulloch  and  Pitts, 
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1943),  was  a simplified  model  of  the  neuron  as  a kind  of  ‘computing  ele- 
ment' that  could  be  described  in  terms  of  propositional  logic.  The  model 
was  a binary  device,  at  any  point  either  active  or  not,  which  took  excitatory 
and  inhibitatory  input  from  other  neurons  and  tired  if  its  activation  passed 
some  fixed  threshold.  Based  on  the  McCulloch-Pitts  neuron,  Kleene  (1951) 
and  (1956)  defined  the  finite  automaton  and  regular  expressions,  and  proved 
their  equivalence.  Non-deterministic  automata  were  introduced  by  Rabin 
and  Scott  (1959),  who  also  proved  them  equivalent  to  deterministic  ones. 

Ken  Thompson  was  one  of  the  first  to  build  regular  expressions  compil- 
ers into  editors  for  text  searching  (Thompson,  1968).  His  editor  ed  included 
a command  “g/regular  expression/p”,  or  Global  Regular  Expression  Print, 
which  later  became  the  UNIX  grep  utility. 

There  arc  many  general-purpose  introductions  to  the  mathematics  un- 
derlying automata  theory;  such  as  Hopcroft  and  Ullman  (1979)  and  Lewis 
and  Papadimitriou  (1981).  These  cover  the  mathematical  foundations  the 
simple  automata  of  this  chapter,  as  well  as  the  finite-state  transducers  of 
Chapter  3,  the  context-free  grammars  of  Chapter  9,  and  the  Chomsky  hier- 
archy of  Chapter  13.  Friedl  (1997)  is  a very  useful  comprehensive  guide  to 
the  advanced  use  of  regular  expressions. 

The  metaphor  of  problem-solving  as  search  is  basic  to  Artificial  Intel- 
ligence (AI);  more  details  on  search  can  be  found  in  any  AI  textbook  such  as 
Russell  and  Norvig  (1995). 


Exercises 

2.1  Write  regular  expressions  for  the  following  languages:  You  may  use 
either  Perl  notation  or  the  minimal  ‘algebraic’  notation  of  Section  2.3,  but 
make  sure  to  say  which  one  you  arc  using.  By  ‘word’,  we  mean  an  alphabetic 
string  separated  from  other  words  by  white  space,  any  relevant  punctuation, 
line  breaks,  etc. 

a.  the  set  of  all  alphabetic  strings. 

b.  the  set  of  all  lowercase  alphabetic  strings  ending  in  a b. 

c.  the  set  of  all  strings  with  two  consecutive  repeated  words  (for  example 
‘Humbert  Humbert’  and  ‘the  the’  but  not  ‘the  bug’  or  ‘the  big  bug’). 
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d.  the  set  of  all  strings  from  the  alphabet  a.b  such  that  each  a is  immedi- 
ately preceded  and  immediately  followed  by  a b. 

e.  all  strings  which  start  at  the  beginning  of  the  line  with  an  integer  (i.e. 
1,2,3...  10...  10000...)  and  which  end  at  the  end  of  the  line  with  a word. 

f.  all  strings  which  have  both  the  word  grotto  and  the  word  raven  in  them, 
(but  not,  for  example,  words  like  grottos  that  merely  contain  the  word 
grotto ). 

g.  write  a pattern  which  places  the  first  word  of  an  English  sentence  in  a 
register.  Deal  with  punctuation. 

2.2  Implement  an  ELIZA-like  program,  using  substitutions  such  as  those 
described  on  page  32.  You  may  choose  a different  domain  than  a Rogerian 
psychologist,  if  you  wish,  although  keep  in  mind  that  you  would  need  a 
domain  in  which  your  program  can  legitimately  do  a lot  of  simple  repeating- 
back. 


2.3  Complete  the  FSA  for  English  money  expressions  in  Figure  2.16  as 
suggested  in  the  text  following  the  figure.  You  should  handle  amounts  up 
to  $100,000,  and  make  sure  that  “cent”  and  “dollar”  have  the  proper  plural 
endings  when  appropriate. 

2.4  Design  an  FSA  that  recognizes  simple  date  expressions  like  March  15, 
the  22nd  of  November,  Christmas.  You  should  try  to  include  all  such  ‘ab- 
solute’ dates,  (e.g.  not  ‘deictic’  ones  relative  to  the  current  day  like  the  day 
before  yesterday).  Each  edge  of  the  graph  should  have  a word  or  a set  of 
words  on  it.  You  should  use  some  sort  of  shorthand  for  classes  of  words  to 
avoid  drawing  too  many  arcs  (e.g.  Furniture  — > desk,  chair,  table) 

2.5  Now  extend  your  date  FSA  to  handle  deictic  expressions  like  yesterday, 
tomorrow,  a week  from  tomorrow,  the  day  before  yesterday,  Sunday,  next 
Monday,  three  weeks  from  Saturday. 

2.6  Write  an  FSA  for  time-of-day  expressions  like  eleven  o ’clock,  twelve- 
thirty,  midnight,  or  a quarter  to  ten  and  others. 

2.7  Write  a regular  expression  for  the  language  accepted  by  the  NFSA  in 
Figure  2.27 

2.8  Currently  the  function  D-RECOGNIZE  in  Figure  2. 13  only  solves  a sub- 
part of  the  important  problem  of  finding  a string  in  some  text.  Extend  the 
algorithm  to  solve  the  following  two  deficiencies:  (1)  D-RECOGNIZE  cur- 
rently assumes  that  it  is  already  pointing  at  the  string  to  be  checked.  (2) 
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D-RECOGNIZE  fails  if  the  string  it  is  pointing  includes  as  a proper  substring 
a legal  string  for  the  FSA.  That  is,  D-RECOGNIZE  fails  if  there  is  an  extra 
character  at  the  end  of  the  string. 

2.9  Give  an  algorithm  for  negating  a deterministic  FSA.  The  negation  of  an 
FSA  accepts  exactly  the  set  of  strings  that  the  original  FSA  rejects  (over  the 
same  alphabet),  and  rejects  all  the  strings  that  the  original  FSA  accepts. 

2.10  Why  doesn’t  your  previous  algorithm  work  with  NFSAs?  Now  extend 
your  algorithm  to  negate  an  NFS  A. 


MORPHOLOGY  AND 

FINITE-STATE 

TRANSDUCERS 


A writer  is  someone  who  writes,  and  a stinger  is  something  that 
stings.  But  fingers  don’t  fing,  grocers  don’t  groce,  haberdash- 
ers don  7 haberdash,  hammers  don  7 ham,  and  humdingers  don  7 
humding. 

Richard  Lederer,  Crazy  English 


Chapter  2 introduced  the  regular  expression,  showing  for  example  how 
a single  search  string  could  help  a web  search  engine  find  both  woodchuck 
and  woodchucks.  Hunting  for  singular  or  plural  woodchucks  was  easy;  the 
plural  just  tacks  an  s on  to  the  end.  But  suppose  we  were  looking  for  another 
fascinating  woodland  creatures;  let’s  say  a fox,  and  a fish,  that  surly  peccary 
and  perhaps  a Canadian  wild  goose.  Hunting  for  the  plurals  of  these  animals 
takes  more  than  just  tacking  on  an  s.  The  plural  of  fox  is  foxes',  of  peccary, 
peccaries',  and  of  goose,  geese.  To  confuse  matters  further,  fish  don’t  usually 
change  their  form  when  they  arc  plural  (as  Dr.  Seuss  points  out:  one  fish  two 
fish,  red  fish,  blue  fish). 

It  takes  two  kinds  of  knowledge  to  correctly  search  for  singulars  and 
plurals  of  these  forms.  Spelling  rules  tell  us  that  English  words  ending  in  -y 
arc  pluralized  by  changing  the  -y  to  -i-  and  adding  an  -es.  Morphological 
rules  tell  us  that  fish  has  a null  plural,  and  that  the  plural  of  goose  is  formed 
by  changing  the  vowel. 

The  problem  of  recognizing  that  foxes  breaks  down  into  the  two  mor- 
phemes fox  and  -es  is  called  morphological  parsing. 

Key  Concept  #2.  Parsing  means  taking  an  input  and  producing  some  parsing 

sort  of  structure  for  it. 


We  will  use  the  term  parsing  very  broadly  throughout  this  book,  including 
many  kinds  of  structures  that  might  be  produced;  morphological,  syntactic, 
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semantic,  pragmatic;  in  the  form  of  a string,  or  a tree,  or  a network.  In 
the  information  retrieval  domain,  the  similar  (but  not  identical)  problem  of 
stemming  mapping  from  foxes  to  fox  is  called  stemming.  Morphological  parsing  or 
stemming  applies  to  many  affixes  other  than  plurals;  for  example  we  might 
need  to  take  any  English  verb  form  ending  in  -ing  (going,  talking,  congrat- 
ulating) and  parse  it  into  its  verbal  stem  plus  the  -ing  morpheme.  So  given 
surface  the  surface  or  input  form  going,  we  might  want  to  produce  the  parsed  form 
VERB-go  + GERUND-ing.  This  chapter  will  survey  the  kinds  of  mor- 
phological knowledge  that  needs  to  be  represented  in  different  languages  and 
introduce  the  main  component  of  an  important  algorithm  for  morphological 
parsing:  the  finite-state  transducer. 

Why  don’t  we  just  list  all  the  plural  forms  of  English  nouns,  and  all  the 
-ing  forms  of  English  verbs  in  the  dictionary?  The  major  reason  is  that  -ing 
productive  is  a productive  suffix;  by  this  we  mean  that  it  applies  to  every  verb.  Simi- 
larly -s  applies  to  almost  every  noun.  So  the  idea  of  listing  every  noun  and 
verb  can  be  quite  inefficient.  Furthermore,  productive  suffixes  even  apply  to 
new  words  (so  the  new  word  fax  automatically  can  be  used  in  the  -ing  form: 
faxing).  Since  new  words  (particularly  acronyms  and  proper  nouns)  are  cre- 
ated every  day,  the  class  of  nouns  in  English  increases  constantly,  and  we 
need  to  be  able  to  add  the  plural  morpheme  -s  to  each  of  these.  Additionally, 
the  plural  form  of  these  new  nouns  depends  on  the  spelling/pronunciation  of 
the  singular  form;  for  example  if  the  noun  ends  in  -z  then  the  plural  form  is 
-es  rather  than  -s.  We’ll  need  to  encode  these  rules  somewhere.  Finally,  we 
certainly  cannot  list  all  the  morphological  valiants  of  every  word  in  morpho- 
logically complex  languages  like  Turkish,  which  has  words  like  the  follow- 
ing: 

(3.1)  uygaiiastiramadiklamnizdanmis smizcasma 

uygar  +la§  +tir  +ama  +dik  +lar  +imiz 

civilized  +BEC  +CAUS  +NEGABLE  +PPART  +PL  +PlPL 
-i -dan  +mi§  +simz  +casina 
+ABL  +PAST  +2pl  +Aslf 

‘(behaving)  as  if  you  are  among  those  whom  we  could  not 
civilize/cause  to  become  civilized’ 

The  various  pieces  of  this  word  (the  morphemes)  have  these  meanings: 

+BEC  is  ‘become’  in  English 

+CAUS  is  the  causative  voice  marker  on  a verb 

+NegAble  is  ‘notable’  in  English 
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+PPart  marks  a past  participle  form 

+P 1 PL  is  1 st  person  pi  possessive  agreement 

+2pl  is  2nd  person  pi 

+ABL  is  the  ablative  (from/among)  case  marker 

+Aslf  is  a derivational  marker  that  forms  an  adverb  from  a finite  verb  form 

In  such  languages  we  clearly  need  to  parse  the  input  since  it  is  impos- 
sible to  store  every  possible  word.  Kemal  Oflazer  (p.c.),  who  came  up  with 
this  example,  notes  that  verbs  in  Turkish  have  40,000  forms  not  counting 
derivational  suffixes;  adding  derivational  suffixes  allows  a theoretically  in- 
finite number  of  words.  This  is  true  because  for  example  any  verb  can  be 
‘causativized’  like  the  example  above,  and  multiple  instances  of  causativiza- 
tion  can  be  embedded  in  a single  word  {you  cause  X to  cause  Y to  ....  do  W). 
Not  all  Turkish  words  look  like  this;  Oflazer  finds  that  the  average  Turkish 
word  has  about  three  morphemes  (a  root  plus  two  suffixes).  Even  so,  the 
fact  that  such  words  arc  possible  means  that  it  will  be  difficult  to  store  all 
possible  Turkish  words  in  advance. 

Morphological  parsing  is  necessary  for  more  than  just  information  re- 
trieval. We  will  need  it  in  machine  translation  to  realize  that  the  French 
words  va  and  aller  should  both  translate  to  forms  of  the  English  verb  go. 
We  will  also  need  it  in  spell  checking;  as  we  will  see,  it  is  morphological 
knowledge  that  will  tell  us  that  misclam  and  antiundo, ggingly  arc  not  words. 

The  next  sections  will  summarize  morphological  facts  about  English 
and  then  introduce  the  finite-state  transducer. 

3.1  Survey  of  (Mostly)  English  Morphology 

Morphology  is  the  study  of  the  way  words  arc  built  up  from  smaller  meaning- 
bearing units,  morphemes.  A morpheme  is  often  defined  as  the  minimal 
meaning-bearing  unit  in  a language.  So  for  example  the  word  fox  consists  of 
a single  morpheme  (the  morpheme  fox)  while  the  word  cats  consists  of  two: 
the  morpheme  cat  and  the  morpheme  -s. 

As  this  example  suggests,  it  is  often  useful  to  distinguish  two  broad 
classes  of  morphemes:  stems  and  affixes.  The  exact  details  of  the  distinc- 
tion vary  from  language  to  language,  but  intuitively,  the  stem  is  the  ‘main’ 
morpheme  of  the  word,  supplying  the  main  meaning,  while  the  affixes  add 
‘additional’  meanings  of  various  kinds. 

Affixes  arc  further  divided  into  prefixes,  suffixes,  infixes,  and  circum- 
fixes.  Prefixes  precede  the  stem,  suffixes  follow  the  stem,  circumfixes  do 
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both,  and  infixes  arc  inserted  inside  the  stem.  For  example,  the  word  eats  is 
composed  of  a stem  eat  and  the  suffix  -s.  The  word  unbuckle  is  composed  of 
a stem  buckle  and  the  prefix  un-.  English  doesn’t  have  any  good  examples 
of  circumfixes,  but  many  other  languages  do.  In  German,  for  example,  the 
past  participle  of  some  verbs  formed  by  adding  ge-  to  the  beginning  of  the 
stem  and  -t  to  the  end;  so  the  past  participle  of  the  verb  sagen  (to  say)  is 
gesagt  (said).  Infixes,  in  which  a morpheme  is  inserted  in  the  middle  of  a 
word,  occur  very  commonly  for  example  in  the  Philipine  language  Tagalog. 
For  example  the  affix  um,  which  marks  the  agent  of  an  action,  is  infixed  to 
the  Tagalog  stem  hingi  ‘borrow’  to  produce  humingi.  There  is  one  infix  that 
occurs  in  some  dialects  of  English  in  which  the  taboo  morpheme  ‘f**king’ 
or  others  like  it  arc  inserted  in  the  middle  of  other  words  (‘Man-f**king- 
hattan’)  (McCawley,  1978). 

Prefixes  and  suffixes  arc  often  called  concatenative  morphology  since 
a word  is  composed  of  a number  of  morphemes  concatenated  together.  A 
number  of  languages  have  extensive  non-concatenative  morphology,  in 
which  morphemes  arc  combined  in  more  complex  ways.  The  Tagalog  in- 
fixation example  above  is  one  example  of  non-concatenative  morphology, 
since  two  morphemes  ( hingi  and  um)  arc  intermingled.  Another  kind  of 
non-concatenative  morphology  is  called  templatic  morphology  or  root- 
and-pattern  morphology.  This  is  very  common  in  Arabic,  Hebrew,  and 
other  Semitic  languages.  In  Hebrew,  for  example,  a verb  is  constructed  us- 
ing two  components:  a root,  consisting  usually  of  three  consonants  (CCC) 
and  carrying  the  basic  meaning,  and  a template,  which  gives  the  ordering  of 
consonants  and  vowels  and  specifies  more  semantic  information  about  the 
resulting  verb,  such  as  the  semantic  voice  (e.g.  active,  passive,  middle).  For 
example  the  Hebrew  tri-consonantal  root  Imd , meaning  'learn’  or  ‘study’, 
can  be  combined  with  the  active  voice  CaCaC  template  to  produce  the  word 
lamad , ‘he  studied’,  or  the  intensive  CiCeC  template  to  produce  the  word 
limed , ‘he  taught’,  or  the  intensive  passive  template  CuCaC  to  produce  the 
word  lumad,  ‘he  was  taught’. 

A word  can  have  more  than  one  affix.  For  example,  the  word  rewrites 
has  the  prefix  re -,  the  stem  write,  and  the  suffix  -s.  The  word  unbelievably 
has  a stem  ( believe ) plus  three  affixes  (un-,  -able,  and  -ly).  While  English 
doesn’t  tend  to  stack  more  than  4 or  5 affixes,  languages  like  Turkish  can 
have  words  with  9 or  10  affixes,  as  we  saw  above.  Languages  that  tend  to 
string  affixes  together  like  Turkish  does  are  called  agglutinative  languages. 

There  are  two  broad  (and  partially  overlapping)  classes  of  ways  to  form 
inflection  words  from  morphemes:  inflection  and  derivation.  Inflection  is  the  combi- 
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nation  of  a word  stem  with  a grammatical  morpheme,  usually  resulting  in  a 
word  of  the  same  class  as  the  original  stem,  and  usually  hlling  some  syntac- 
tic function  like  agreement.  For  example,  English  has  the  inflectional  mor- 
pheme -s  for  marking  the  plural  on  nouns,  and  the  inflectional  morpheme 
-ed  for  marking  the  past  tense  on  verbs.  Derivation  is  the  combination  of  a 
word  stem  with  a grammatical  morpheme,  usually  resulting  in  a word  of  a 
different  class,  often  with  a meaning  hard  to  predict  exactly.  For  example  the 
verb  computerize  can  take  the  derivational  suffix  -ation  to  produce  the  noun 
computerization. 

Inflectional  Morphology 

English  has  a relatively  simple  inflectional  system;  only  nouns,  verbs,  and 
sometimes  adjectives  can  be  inflected,  and  the  number  of  possible  inflec- 
tional affixes  is  quite  small. 

English  nouns  have  only  two  kinds  of  inflection:  an  affix  that  marks 
plural  and  an  affix  that  marks  possessive.  For  example,  many  (but  not  all) 
English  nouns  can  either  appeal-  in  the  bare  stem  or  singular  form,  or  take  a 
plural  suffix.  Here  are  examples  of  the  regular  plural  suffix  -s,  the  alternative 
spelling  -es,  and  irregular  plurals: 


Regular  Nouns 

Irregular  Nouns 

Singular 

cat 

thrush 

mouse 

ox 

Plural 

cats 

thrushes 

mice 

oxen 

While  the  regular  plural  is  spelled  -5  after  most  nouns,  it  is  spelled  -es 
after  words  ending  in  -s  { ibis/ibises ) , -z,  {waltz/waltzes)  - sh , ( thrush/thrushes ) 
-c/7,  (Jinch/f inches)  and  sometimes  -x  { box/boxes ).  Nouns  ending  in  -y  pre- 
ceded by  a consonant  change  the  -y  to  -i  0 butterfly/butterflies ). 

The  possessive  suffix  is  realized  by  apostrophe  + -s  for  regular  singular 
nouns  {llama’s)  and  plural  nouns  not  ending  in  -s  {children’s)  and  often  by  a 
lone  apostrophe  after  regular  plural  nouns  {llamas  ’)  and  some  names  ending 
in  -s  or  -z  {Euripides’  comedies). 

English  verbal  inflection  is  more  complicated  than  nominal  inflection. 
First,  English  has  three  kinds  of  verbs;  main  verbs,  {eat,  sleep,  impeach ), 
modal  verbs  {can,  will,  should ),  and  primary  verbs  {he,  have,  do)  (using 
the  terms  of  Quirk  et  al. , 1985a).  In  this  chapter  we  will  mostly  be  concerned 
with  the  main  and  primary  verbs,  because  it  is  these  that  have  inflectional 
endings.  Of  these  verbs  a large  class  are  regular,  that  is  to  say  all  verbs  of 
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this  class  have  the  same  endings  marking  the  same  functions.  These  regular 
verbs  (e.g.  walk,  or  inspect ),  have  four  morphological  forms,  as  follow: 


Morphological  Form  Classes 

Regularly  Inflected  Verbs 

stem 

-v  form 

-ing  participle 

Past  form  or  -ed  participle 

walk 

walks 

walking 

walked 

merge 

merges 

merging 

merged 

try 

tries 

frying 

tried 

map 

maps 

mapping 

mapped 

These  verbs  arc  called  regular  because  just  by  knowing  the  stem  we 
can  predict  the  other  forms,  by  adding  one  of  three  predictable  endings,  and 
making  some  regular  spelling  changes  (and  as  we  will  see  in  Chapter  4,  reg- 
ular- pronunciation  changes).  These  regular  verbs  and  forms  are  significant  in 
the  morphology  of  English  first  because  they  cover  a majority  of  the  verbs, 
and  second  because  the  regular  class  is  productive.  As  discussed  earlier,  a 
productive  class  is  one  that  automatically  includes  any  new  words  that  enter 
the  language.  For  example  the  recently-created  verb  fax  {My  mom  faxed  me 
the  note  from  cousin  Everett),  takes  the  regular  endings  -ed,  -ing,  -es.  (Note 
that  the  -s  form  is  spelled  faxes_  rather  than  faxsy,  we  will  discuss  spelling 
rules  below). 

vrerebgsular  The  irregular  verbs  are  those  that  have  some  more  or  less  idiosyn- 

cratic forms  of  inflection.  Irregular  verbs  in  English  often  have  five  different 
forms,  but  can  have  as  many  as  eight  (e.g.  the  verb  he)  or  as  few  as  three  (e.g. 
cut  or  hit).  While  constituting  a much  smaller  class  of  verbs  (Quirk  el  al. 
(1985a)  estimate  there  are  only  about  250  irregular  verbs,  not  counting  aux- 
iliaries), this  class  includes  most  of  the  very  frequent  verbs  of  the  language.1 
The  table  below  shows  some  sample  irregular  forms.  Note  that  an  irregular 

preterite  verb  can  inflect  in  the  past  form  (also  called  the  preterite)  by  changing  its 
vowel  {eat/ate),  or  its  vowel  and  some  consonants  {catch/caught),  or  with  no 
ending  at  all  {cut/cut). 


1 In  general,  the  more  frequent  a word  form,  the  more  likely  it  is  to  have  idiosyncratic 
properties;  this  is  due  to  a fact  about  language  change;  very  frequent  words  preserve  their 
form  even  if  other  words  around  them  are  changing  so  as  to  become  more  regular. 
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Morphological  Form  Classes 

Irregularly  Inflected  Verbs 

stem 
-s  form 
-ing  participle 
Past  form 
-ed  participle 

eat 

eats 

eating 

ate 

eaten 

catch 

catches 

catching 

caught 

caught 

cut 

cuts 

cutting 

cut 

cut 

The  way  these  forms  arc  used  in  a sentence  will  be  discussed  in  Chap- 
ters 8-12  but  is  worth  a brief  mention  here.  The  -s  form  is  used  in  the  ‘ha- 
bitual present’  form  to  distinguish  the  3rd-person  singular  ending  (She  jogs 
every  Tuesday ) from  the  other  choices  of  person  and  number  (I/you/we/they 
jog  every  Tuesday).  The  stem  form  is  used  in  the  infinitive  form,  and  also 
after  certain  other  verbs  ( I’d  rather  walk  home , I want  to  walk  home).  The 
-ing  participle  is  used  when  the  verb  is  treated  as  a noun;  this  particular 
kind  of  nominal  use  of  a verb  is  called  a gerund  use:  Fishing  is  fine  if  you 
live  near  water.  The  -ed  participle  is  used  in  the  perfect  construction  (He ’s 
eaten  lunch  already)  or  the  passive  construction  (The  verdict  was  overturned 
yesterday.). 

In  addition  to  noting  which  suffixes  can  be  attached  to  which  stems, 
we  need  to  capture  the  fact  that  a number  of  regular  spelling  changes  occur 
at  these  morpheme  boundaries.  For  example,  a single  consonant  letter  is 
doubled  before  adding  the  -ing  and  -ed  suffixes  (beg/begging/begged).  If  the 
final  letter  is  ‘c’,  the  doubling  is  spelled  ‘ck’  (picnic/picnicking/picnicked). 
If  the  base  ends  in  a silent  -e,  it  is  deleted  before  adding  -ing  and  -ed  (merge/- 
merging/merged).  Just  as  for  nouns,  the  -s  ending  is  spelled  -es  after  verb 
stems  ending  in  -s  (toss/tosses)  , -z,  (waltz/waltzes)  -sh,  (wash/washes)  - ch , 
(catcli/catches)  and  sometimes  -x  (tax/tcvces).  Also  like  nouns,  verbs  ending 
in  -y  preceded  by  a consonant  change  the  -y  to  -i  (try/tries). 

The  English  verbal  system  is  much  simpler  than  for  example  the  Eu- 
ropean Spanish  system,  which  has  as  many  as  fifty  distinct  verb  forms  for 
each  regular  verb.  Figure  3.1  shows  just  a few  of  the  examples  for  the  verb 
amar,  ‘to  love’.  Other  languages  can  have  even  more  forms  than  this  Spanish 
example. 


GERUND 

PERFECT 


Derivational  Morphology 

While  English  inflection  is  relatively  simple  compared  to  other  languages, 
derivation  in  English  is  quite  complex.  Recall  that  derivation  is  the  combi- 
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Present 

Indicative 

Impel'. 

Imperfect 

Indicative 

Future 

Preterite 

Present 

Subjnct. 

Conditional 

Imperfect 

Subjnct. 

Future 

Subjnct. 

amo 

amaba 

amare 

ame 

ame 

amaria 

amara 

amare 

amas 

ama 

ames 

amabas 

amaras 

amaste 

ames 

amarias 

amaras 

amares 

ama 

amaba 

amara 

amo 

ame 

amaria 

amara 

amareme 

amamos 

amabamos 

amaremos 

amamos 

amemos 

amarfamos 

amaramos 

amaremos 

amais 

amad 

amais 

amabais 

amareis 

amasteis 

ameis 

amariais 

amarais 

amareis 

aman 

amaban 

amaran 

amaron 

amen 

amarian 

amaran 

amaren 

Figure  3.1  To  love  in  Spanish. 


nation  of  a word  stem  with  a grammatical  morpheme,  usually  resulting  in  a 
word  of  a different  class,  often  with  a meaning  hard  to  predict  exactly. 

A very  common  kind  of  derivation  in  English  is  the  formation  of  new 
nouns,  often  from  verbs  or  adjectives.  This  process  is  called  nominalization. 
For  example,  the  suffix  -at ion  produces  nouns  from  verbs  ending  often  in  the 
suffix  -ize  ( computerize  — > computerization).  Flere  arc  examples  of  some 
particularly  productive  English  nominalizing  suffixes. 


Suffix 

Base  Verb/Adjective 

Derived  Noun 

-ation 

-ee 

-er 

-ness 

computerize  (V) 
appoint  (V) 
kill  (V) 
fuzzy  (A) 

computerization 

appointee 

killer 

fuzziness 

Adjectives  can  also  be  derived  from  nouns  and  verbs.  Here  arc  exam- 
ples of  a few  suffixes  deriving  adjectives  from  nouns  or  verbs. 


Suffix 

Base  No un/ Verb 

Derived  Adjective 

-al 

-able 

-less 

computation  (N) 
embrace  (V) 
clue  (N) 

computational 

embraceable 

clueless 

Derivation  in  English  is  more  complex  than  inflection  for  a number  of 
reasons.  One  is  that  it  is  generally  less  productive;  even  a nominalizing  suf- 
fix like  -ation,  which  can  be  added  to  almost  any  verb  ending  in  -ize,  cannot 
be  added  to  absolutely  every  verb.  Thus  we  can’t  say  *eatation  or  *spella- 
tion  (we  use  an  asterisk  (*)  to  mark  ‘non-examples’  of  English).  Another 
is  that  there  arc  subtle  and  complex  meaning  differences  among  nominaliz- 
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ing  suffixes.  For  example  sincerity  has  a subtle  difference  in  meaning  from 
sincereness. 

3.2  Finite-State  Morphological  Parsing 

Let’s  now  proceed  to  the  problem  of  parsing  English  morphology.  Consider 
a simple  example:  parsing  just  the  productive  nominal  plural  (-s)  and  the 
verbal  progressive  {-ing).  Our  goal  will  be  to  take  input  forms  like  those  in 
the  first  column  below  and  produce  output  forms  like  those  in  the  second 
column. 


Input 

Morphological  Parsed  Output 

cats 

cat  +N  +PL 

cat 

cat  +N  +SG 

cities 

city  +N  +PL 

geese 

goose  +N  +PL 

goose 

(goose  +N  +SG)  or  (goose  +V) 

gooses 

goose  +V  +3SG 

merging 

merge  +V  +PRES-PART 

caught 

(catch  +V  +PAST-PART)  or  (catch  +V  +PAST) 

The  second  column  contains  the  stem  of  each  word  as  well  as  assorted 
morphological  features.  These  features  specify  additional  information  about 
the  stem.  For  example  the  feature  +N  means  that  the  word  is  a noun;  +SG 
means  it  is  singular,  +PL  that  it  is  plural.  We  will  discuss  features  in  Chap- 
ter 11;  for  now,  consider  +SG  to  be  a primitive  unit  that  means  ‘singular’. 
Note  that  some  of  the  input  forms  (like  caught  or  goose ) will  be  ambiguous 
between  different  morphological  parses. 

In  order  to  build  a morphological  parser,  we’ll  need  at  least  the  follow- 
ing: 

1.  a lexicon:  The  list  of  stems  and  affixes,  together  with  basic  information 
about  them  (whether  a stem  is  a Noun  stem  or  a Verb  stem,  etc). 

2.  morphotactics:  the  model  of  morpheme  ordering  that  explains  which 
classes  of  morphemes  can  follow  other  classes  of  morphemes  inside  a 
word.  For  example,  the  rule  that  the  English  plural  morpheme  follows 
the  noun  rather  than  preceding  it. 

3.  orthographic  rules:  these  spelling  rules  are  used  to  model  the  changes 
that  occur  in  a word,  usually  when  two  morphemes  combine  (for  ex- 


FEATURES 
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ample  the  y — > ie  spelling  rule  discussed  above  that  changes  city  + -s 
to  cities  rather  than  citys). 

The  next  paid  of  this  section  will  discuss  how  to  represent  a simple  ver- 
sion of  the  lexicon  just  for  the  sub-problem  of  morphological  recognition, 
including  how  to  use  FSAs  to  model  morphotactic  knowledge.  We  will  then 
introduce  the  finite-state  transducer  (FST)  as  a way  of  modeling  morpholog- 
ical features  in  the  lexicon,  and  addressing  morphological  parsing.  Finally, 
we  show  how  to  use  FSTs  to  model  orthographic  rules. 


The  Lexicon  and  Morphotactics 

A lexicon  is  a repository  for  words.  The  simplest  possible  lexicon  would 
consist  of  an  explicit  list  of  every  word  of  the  language  ( every  word,  i.e. 
including  abbreviations  (‘AAA’)  and  proper  names  (‘Jane’  or  ‘Beijing’)  as 
follows: 

a 

AAA 

AA 

Aachen 

aardvark 

aardwolf 

aba 

abaca 

aback 


Since  it  will  often  be  inconvenient  or  impossible,  for  the  various  rea- 
sons we  discussed  above,  to  list  every  word  in  the  language,  computational 
lexicons  arc  usually  structured  with  a list  of  each  of  the  stems  and  affixes  of 
the  language  together  with  a representation  of  the  morphotactics  that  tells  us 
how  they  can  fit  together.  There  arc  many  ways  to  model  morphotactics;  one 
of  the  most  common  is  the  finite-state  automaton.  A very  simple  finite-state 
model  for  English  nominal  inflection  might  look  like  Figure  3.2. 

The  FSA  in  Figure  3.2  assumes  that  the  lexicon  includes  regular  nouns 
(reg-noun)  that  take  the  regular  -s  plural  (e.g.  cat , clog,  fox,  aardvark ). 
These  arc  the  vast  majority  of  English  nouns  since  for  now  we  will  ignore 
the  fact  that  the  plural  of  words  like  fox  have  an  inserted  e:  foxes.  The 
lexicon  also  includes  irregular  noun  forms  that  don’t  take  -s,  both  singular 
irreg-sg-noun  (goose,  mouse)  and  plural  irreg-pl-noun  (geese,  mice). 
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reg-noun  plural  (-s) 


Figure  3.2  A finite-state  automaton  for  English  nominal  inflection. 


reg-noun 

irreg-pl-noun 

irreg-sg-noun 

plural 

fox 

geese 

goose 

-s 

cat 

sheep 

sheep 

dog 

mice 

mouse 

aardvark 

A similar  model  for  English  verbal  inflection  might  look  like  Fig- 
ure 3.3. 


This  lexicon  has  three  stem  classes  (reg-verb-stem,  irreg-verb-stem, 
and  irreg-past-verb-form),  plus  4 more  affix  classes  ( -ed  past,  -ed  participle, 
-ing  participle,  and  3rd  singular  - s ): 
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reg-verb- 

stem 

irreg-verb- 

stem 

irreg-past- 

verb 

past 

past-part 

pres-part 

3sg 

walk 

fry 

talk 

impeach 

cut 

speak 

sing 

sang 

cut 

spoken 

caught 

ate 

eaten 

-ed 

-ed 

-ing 

-s 

English  derivational  morphology  is  significantly  more  complex  than 
English  inflectional  morphology,  and  so  automata  for  modeling  English  deriva- 
tion tend  to  be  quite  complex.  Some  models  of  English  derivation,  in  fact, 
arc  based  on  the  more  complex  context-free  grammars  of  Chapter  9 (Sproat, 
1993;  Orgun,  1995). 

As  a preliminary  example,  though,  of  the  kind  of  analysis  it  would 
require,  we  present  a small  paid  of  the  morphotactics  of  English  adjectives, 
taken  from  Antworth  (1990).  Antworth  offers  the  following  data  on  English 
adjectives: 

big,  bigger,  biggest 
cool,  cooler,  coolest,  coolly 
red,  redder,  reddest 

clear,  clearer,  clearest,  clearly,  unclear,  unclearly 
happy,  happier,  happiest,  happily 
unhappy,  unhappier,  unhappiest,  unhappily 
real,  unreal,  really 

An  initial  hypothesis  might  be  that  adjectives  can  have  an  optional  pre- 
fix ( un -),  an  obligatory  root  (big,  cool,  etc)  and  an  optional  suffix  ( -ei;  -est, 
or  -ly).  This  might  suggest  the  the  FSA  in  Figure  3.4. 

Alas,  while  this  FSA  will  recognize  all  the  adjectives  in  the  table  above, 
it  will  also  recognize  ungrammatical  forms  like  unbig,  redly,  and  realest. 
We  need  to  set  up  classes  of  roots  and  specify  which  can  occur  with  which 
suffixes.  So  adj-root]  would  include  adjectives  that  can  occur  with  un-  and 
-ly  (clear,  happy,  and  real)  while  adj-root?  will  include  adjectives  that  can’t 
(big,  cool,  and  red).  Antworth  (1990)  presents  Figure  3.5  as  a partial  solution 
to  these  problems. 

This  gives  an  idea  of  the  complexity  to  be  expected  from  English 
derivation.  For  a further  example,  we  give  in  Figure  3.6  another  fragment 
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Figure  3.5  An  FSA  for  a fragment  of  English  adjective  morphology: 
Antworth’s  Proposal  #2. 


of  an  FSA  for  English  nominal  and  verbal  derivational  morphology,  based 
on  Sproat  (1993),  Bauer  (1983),  and  Porter  (1980).  This  FSA  models  a 
number  of  derivational  facts,  such  as  the  well  known  generalization  that  any 
verb  ending  in  -ize  can  be  followed  by  the  nominalizing  suffix  -at ion  (Bauer, 
1983;  Sproat,  1993)).  Thus  since  there  is  a word  fossilize,  we  can  predict  the 
word  fossilization  by  following  states  r/o,  q\,  and  c/2.  Similarly,  adjectives 
ending  in  -al  or  -able  at  q$  {equal,  formal,  realizable ) can  take  the  suffix  -ity, 
or  sometimes  the  suffix  -ness  to  state  c/6  {naturalness,  casualness).  We  leave 
it  as  an  exercise  for  the  reader  (Exercise  3.2)  to  discover  some  of  the  indi- 
vidual exceptions  to  many  of  these  constraints,  and  also  to  give  examples  of 
some  of  the  various  noun  and  verb  classes. 

We  can  now  use  these  FSAs  to  solve  the  problem  of  morphological 
recognition;  that  is,  of  determining  whether  an  input  string  of  letters  makes 
up  a legitimate  English  word  or  not.  We  do  this  by  taking  the  morphotactic 
FSAs,  and  plugging  in  each  ‘sub-lexicon’  into  the  FSA.  That  is,  we  expand 
each  arc  (e.g.  the  reg-noun-stem  arc)  with  all  the  morphemes  that  make  up 
the  set  of  reg-noun-stem.  The  resulting  FSA  can  then  be  defined  at  the  level 
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of  the  individual  letter. 


Figure  3.7  Compiled  FSA  for  a few  English  nouns  with  their  inflection. 
Note  that  this  automaton  will  incorrectly  accept  the  input  foxs.  We  will  see 
beginning  on  page  76  how  to  correctly  deal  with  the  inserted  e in  foxes. 


Figure  3.7  shows  the  noun-recognition  FSA  produced  by  expanding 
the  Nominal  Inflection  FSA  of  Figure  3.2  with  sample  regular  and  irregular 
nouns  for  each  class.  We  can  use  Figure  3.7  to  recognize  strings  like  aard- 
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varks  by  simply  starting  at  the  initial  state,  and  comparing  the  input  letter  by 
letter  with  each  word  on  each  outgoing  arc,  etc.,  just  as  we  saw  in  Chapter  2. 

Morphological  Parsing  with  Finite-State  Transducers 

Now  that  we’ve  seen  how  to  use  FSAs  to  represent  the  lexicon  and  inciden- 
tally do  morphological  recognition,  let’s  move  on  to  morphological  parsing. 

For  example,  given  the  input  cats,  we’d  like  to  output  cat  +N  +PL,  telling 
us  that  cat  is  a plural  noun.  We  will  do  this  via  a version  of  two-level  mor-  two-level 
phology,  first  proposed  by  Koskenniemi  (1983).  Two  level  morphology  rep- 
resents a word  as  a correspondence  between  a lexical  level,  which  represents 
a simple  concatenation  of  morphemes  making  up  a word,  and  the  surface  surface 
level,  which  represents  the  actual  spelling  of  the  final  word.  Morphological 
parsing  is  implemented  by  building  mapping  rules  that  map  letter  sequences 
like  cats  on  the  surface  level  into  morpheme  and  features  sequences  like 
cat  +N  +PL  on  the  lexical  level.  Figure  3.8  shows  these  two  levels  for  the 
word  cats.  Note  that  the  lexical  level  has  the  stem  for  a word,  followed  by 
the  morphological  information  +N  +PL  which  tells  us  that  cats  is  a plural 
noun. 


Lexical 

C 

a 

5S 

mi 

Surface  k 

C 

a 

t 

s 

□ 

Figure  3.8  Example  of  the  lexical  and  surface  tapes. 

The  automaton  that  we  use  for  performing  the  mapping  between  these 
two  levels  is  the  finite-state  transducer  or  FST.  A transducer  maps  between  fst 
one  set  of  symbols  and  another;  a finite-state  transducer  does  this  via  a fi- 
nite automaton.  Thus  we  usually  visualize  an  FST  as  a two-tape  automaton 
which  recognizes  or  generates  pairs  of  strings.  The  FST  thus  has  a more 
general  function  than  an  FSA;  where  an  FSA  defines  a formal  language  by 
defining  a set  of  strings,  an  FST  defines  a relation  between  sets  of  strings. 

This  relates  to  another  view  of  an  FST;  as  a machine  that  reads  one  string 
and  generates  another,  Here’s  a summary  of  this  four-fold  way  of  thinking 
about  transducers: 

• FST  as  recognizer:  a transducer  that  takes  a pair  of  strings  as  input 
and  outputs  accept  if  the  string-pair  is  in  the  string-pair  language,  and 
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a reject  if  it  is  not. 

• FST  as  generator:  a machine  that  outputs  pairs  of  strings  of  the  lan- 
guage. Thus  the  output  is  a yes  or  no,  and  a pair  of  output  strings. 

• FST  as  translator:  a machine  that  reads  a string  and  outputs  another 
string. 

• FST  as  set  relater:  a machine  that  computes  relations  between  sets. 

An  FST  can  be  formally  defined  in  a number  of  ways;  we  will  rely 
on  the  following  definition,  based  on  what  is  called  the  Mealy  machine 
extension  to  a simple  FSA: 

• Q:  a finite  set  of  N states  qo,qi,...q^ 

• £:  a finite  alphabet  of  complex  symbols.  Each  complex  symbol  is 
composed  of  an  input-output  pair  i : o;  one  symbol  i from  an  input 
alphabet  I,  and  one  symbol  o from  an  output  alphabet  O,  thus  £ C 
/ x O.  I and  O may  each  also  include  the  epsilon  symbol  e. 

• qo : the  start  state 

• F:  the  set  of  final  states,  F C Q 

• 8 (q,i  : o):  the  transition  function  or  transition  matrix  between  states. 
Given  a state  q e Q and  complex  symbol  i : o € £,  8 (q.i : o ) returns  a 
new  state  q1  € Q.  8 is  thus  a relation  from  Q x £ to  Q\ 

Where  an  FSA  accepts  a language  stated  over  a finite  alphabet  of  single 
symbols,  such  as  the  alphabet  of  our  sheep  language: 

£ = {b.a. !}  (3.2) 

an  FST  accepts  a language  stated  over  pairs  of  symbols,  as  in: 

£ = {a  : a,  b : (?,  ! : !,  a : !,  a : £,  e : !}  (3.3) 

In  two-level  morphology,  the  pairs  of  symbols  in  £ are  also  called  feasible 
pairs. 

Where  FSAs  are  isomorphic  to  regular  languages,  FSTs  are  isomor- 
phic to  regular  relations.  Regular  relations  are  sets  of  pairs  of  strings,  a 
natural  extension  of  the  regular  languages,  which  are  sets  of  strings.  Like 
FSAs  and  regular  languages,  FSTs  and  regular  relations  are  closed  under 
union,  although  in  general  they  are  not  closed  under  difference,  complemen- 
tation and  intersection  (although  some  useful  subclasses  of  FSTs  are  closed 
under  these  operations;  in  general  FSTs  that  are  not  augmented  with  the  £ 
are  more  likely  to  have  such  closure  properties).  Besides  union,  FSTs  have 
two  additional  closure  properties  that  turn  out  to  be  extremely  useful: 
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inversion  • inversion:  the  inversion  of  a transducer  T (T  1 ) simply  switches  the 

input  and  output  labels.  Thus  if  T maps  from  the  input  alphabet  7 to 
the  output  alphabet  O,  T 1 maps  from  O to  I. 

• composition:  if  7j  is  a transducer  from  I\  to  0\  and  7?  a transducer 
from  h to  Oi,  then  7j  o 73  maps  from  I\  to  02. 

Inversion  is  useful  because  it  makes  it  easy  to  convert  a FST-as-parser 
into  an  FST-as-generator.  Composition  is  useful  because  it  allows  us  to  take 
two  transducers  that  run  in  series  and  replace  them  with  one  more  complex 
transducer.  Composition  works  as  in  algebra;  applying  7j  o 7?  to  an  input 
sequence  S is  identical  to  applying  7j  to  S and  then  73  to  the  result;  thus 
7j  o 72(5)  = T2(Ti  (S')).  We  will  see  examples  of  composition  below. 

We  mentioned  that  for  two-level  morphology  it’s  convenient  to  view 
an  FST  as  having  two  tapes.  The  upper  or  lexical  tape,  is  composed  from 
characters  from  the  left  side  of  the  a : b pairs;  the  lower  or  surface  tape, 
is  composed  of  characters  from  the  right  side  of  the  a : b pairs.  Thus  each 
symbol  a : b in  the  transducer  alphabet  £ expresses  how  the  symbol  a from 
one  tape  is  mapped  to  the  symbol  b on  the  another  tape.  For  example  a : e 
means  that  an  a on  the  upper  tape  will  correspond  to  nothing  on  the  lower 
tape.  Just  as  for  an  FSA,  we  can  write  regular  expressions  in  the  complex 
alphabet  £.  Since  it’s  most  common  for  symbols  to  map  to  themselves,  in 
two-level  morphology  we  call  pairs  like  a : a default  pairs,  and  just  refer  to 
them  by  the  single  letter  a. 

We  are  now  ready  to  build  an  FST  morphological  parser  out  of  our 
earlier  morphotactic  FSAs  and  lexica  by  adding  an  extra  “lexical”  tape  and 
the  appropriate  morphological  features.  Figure  3.9  shows  an  augmentation 
of  Figure  3.2  with  the  nominal  morphological  features  (+SG  and  +PL)  that 
correspond  to  each  morpheme.  Note  that  these  features  map  to  the  empty 
string  e or  the  word/morpheme  boundary  symbol  # since  there  is  no  segment 
corresponding  to  them  on  the  output  tape. 

In  order  to  use  Figure  3.9  as  a morphological  noun  parser,  it  needs  to  be 
augmented  with  all  the  individual  regular  and  irregular  noun  stems,  replacing 
the  labels  regular-noun-stem  etc.  In  order  to  do  this  we  need  to  update  the 
lexicon  for  this  transducer,  so  that  irregular  plurals  like  geese  will  parse  into 
the  correct  stem  goose  +N  +PL.  We  do  this  by  allowing  the  lexicon  to 
also  have  two  levels.  Since  surface  geese  maps  to  underlying  goose,  the 
new  lexical  entry  will  be‘g:g  o:e  o:e  s:s  e:e’.  Regular  forms  are 
simpler;  the  two-level  entry  for  fox  will  now  be‘f:f  o:o  x : x’,  but  by 
relying  on  the  orthographic  convention  that  f stands  for  f : f and  so  on,  we 
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reg-noun-stem 

— — w 

+N:e 

S9 

„ +PL:As# 

18 

■B 

§ 

KgfiS 

life, 

n 

n.  irreg— pi— noun— form 

s 

+PL:#  s' 

i 

Figure  3.9  A transducer  for  English  nominal  number  inflection  Tnum. 
Since  both  q \ and  qi  are  accepting  states,  regular  nouns  can  have  the  plural 
suffix  or  not.  The  morpheme-boundary  symbol  ~ and  word-boundary  marker 
# will  be  discussed  below. 

can  simply  refer  to  it  as  fox  and  the  form  for  geese  as‘g  o:e  o:e  s e\ 
Thus  the  lexicon  will  look  only  slightly  more  complex: 


reg-noun 

irreg-pl-noun 

irreg-sg-noun 

fox 

g o:e  o:e  s e 

goose 

cat 

sheep 

sheep 

dog 

m o:i  u:s  s:c  e 

mouse 

aardvark 

Our  proposed  morphological  parser  needs  to  map  from  surface  forms 
like  geese  to  lexical  forms  like  goose  +N  +SG.  We  could  do  this  by  cas- 
cading the  lexicon  above  with  the  singular/plural  automaton  of  Figure  3.9. 
Cascading  two  automata  means  running  them  in  series  with  the  output  of 
the  first  feeding  the  input  to  the  second.  We  would  first  represent  the  lexi- 
con of  stems  in  the  above  table  as  the  FST  Tstems  of  Figure  3. 10.  This  FST 
maps  e.g.  dog  to  reg-noun-stem.  In  order  to  allow  possible  suffixes,  Tstems 
@ symbol  in  Figure  3. 10  allows  the  forms  to  be  followed  by  the  wildcard  @ symbol; 

0 : 0 stands  for  ‘any  feasible  pair’.  A pair  of  the  form  0 : x,  for  example  will 
mean  ‘any  feasible  pair  which  has  x on  the  surface  level’,  and  correspond- 
ingly for  the  form  x : 0.  The  output  of  this  FST  would  then  feed  the  number 
automaton  Tnum. 

Instead  of  cascading  the  two  transducers,  we  can  compose  them  using 
the  composition  operator  defined  above.  Composing  is  a way  of  taking  a 
cascade  of  transducers  with  many  different  levels  of  inputs  and  outputs  and 
converting  them  into  a single  ‘two-level’  transducer  with  one  input  tape  and 
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one  output  tape.  The  algorithm  for  composition  hears  some  resemblance  to 
the  algorithm  for  determinization  of  FS  As  from  page  49;  given  two  automata 
T\  and  7?  with  state  sets  Q\  and  Qj  and  transition  functions  5i  and  82,  we 
create  a new  possible  state  (jt,y)  for  every  pair  of  states  x € Q 1 and  y £ (F. 
Then  the  new  automaton  has  the  transition  function: 

83 {{xa,ya),i  ■ o)  = ( xb,yb ) if 
3 c s.t.  81  (xa . i : c)  =Xb 

and  8 2(ya,c:o)=yb  (3.4) 

The  resulting  composed  automaton,  Tjex  = Tnum  o Tstems,  is  shown  in 
Figure  3.11  (compare  this  with  the  FSA  lexicon  in  Figure  3.7  on  page  70). 2 
Note  that  the  final  automaton  still  has  two  levels  separated  by  the  : . Because 
the  colon  was  reserved  for  these  levels,  we  had  to  use  the  | symbol  in  Tstems 
in  Figure  3.10  to  separate  the  upper  and  lower  tapes. 


This  transducer  will  map  plural  nouns  into  the  stem  plus  the  morpho- 
logical marker  +PL,  and  singular  nouns  into  the  stem  plus  the  morpheme 
+SG.  Thus  a surface  cats  will  map  to  cat  +N  +PL  as  follows: 

c:c  a:a  t:t  +N:£  +PL:~s# 

That  is,  c maps  to  itself,  as  do  a and  t,  while  the  morphological  feature 
+N  (recall  that  this  means  ‘noun’)  maps  to  nothing  (e),  and  the  feature  +PL 
(meaning  ‘plural’)  maps  to  ~ s.  The  symbol  ~ indicates  a morpheme  bound- 
ary, while  the  symbol  # indicates  a word  boundary.  Figure  3.12  refers  to 

2 Note  that  for  the  purposes  of  clear  exposition  Figure  3.11  has  not  been  minimized  in  the 
way  that  Figure  3.7  has. 
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SPELLING 

RULES 


tapes  with  these  morpheme  boundary  markers  as  intermediate  tapes;  the 
next  section  will  show  how  the  boundary  marker  is  removed. 


Lexical  f 

f 

0 

X 

+N 

+PL 

□ 

Intermediate  ^ 

f 

0 

X 

A 

S 

# 

Figure  3.12  An  example  of  the  lexical  and  intermediate  tapes. 


Orthographic  Rules  and  Finite-State  Transducers 

The  method  described  in  the  previous  section  will  successfully  recognize 
words  like  aardvarks  and  mice.  But  just  concatenating  the  morphemes  won’t 
work  for  cases  where  there  is  a spelling  change;  it  would  incorrectly  reject 
an  input  like  foxes  and  accept  an  input  like  foxs.  We  need  to  deal  with  the 
fact  that  English  often  requires  spelling  changes  at  morpheme  boundaries  by 
introducing  spelling  rules  (or  orthographic  rules).  This  section  introduces 
a number  of  notations  for  writing  such  rules  and  shows  how  to  implement 
the  rules  as  transducers.  Some  of  these  spelling  rules: 
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Name 

Description  of  Rule 

Example 

Consonant 
doubling 
E deletion 
E insertion 
Y replacement 
K insertion 

1 -letter  consonant  doubled  before  -ingl-ecl 

Silent  e dropped  before  -ing  and  -ed 
e added  after  -s,-z,-x,-ch,  -sh  before  -s 
-y  changes  to  -ie  before  -s,  -i  before  -ed 
verbs  ending  with  vowel  + -c  add  -k 

beg/begging 

make/making 

watch/watches 

try/tries 

panic/panicked 

We  can  think  of  these  spelling  changes  as  taking  as  input  a simple 
concatenation  of  morphemes  (the  ‘intermediate  output’  of  the  lexical  trans- 
ducer in  Figure  3. 1 1)  and  producing  as  output  a slightly-modified,  (correctly- 
spelled)  concatenation  of  morphemes.  Figure  3.13  shows  the  three  levels 
we  arc  talking  about:  lexical,  intermediate,  and  surface.  So  for  example 
we  could  write  an  E-insertion  rule  that  performs  the  mapping  from  the  in- 
termediate to  surface  levels  shown  in  Figure  3.13.  Such  a rule  might  say 


Lexical  4 

a 

□ 

+N 

+PL 

Intermediate  \ 

□ 

□ 

A 

S 

# 

13 

Surface 

IB 

IB 

e 

s 

n 

Figure  3.13  An  example  of  the  lexical,  intermediate  and  surface  tapes. 
Between  each  pair  of  tapes  is  a 2-level  transducer;  the  lexical  transducer  of 
Figure  3.11  between  the  lexical  and  intermediate  levels,  and  the  E-insertion 
spelling  rule  between  the  intermediate  and  surface  levels.  The  E-insertion 
spelling  rule  inserts  an  e on  the  surface  tape  when  the  intermediate  tape  has  a 
morpheme  boundary  ~ followed  by  the  morpheme  -,v. 


something  like  “insert  an  e on  the  surface  tape  just  when  the  lexical  tape  has 
a morpheme  ending  in  x (or  z,  etc)  and  the  next  morpheme  is  -s.  Here’s  a 
formalization  of  the  rule: 


This  is  the  rule  notation  of  Chomsky  and  Halle  (1968);  a rule  of  the 
form  a — > b! c cl  means  ‘rewrite  a as  b when  it  occurs  between  c and 
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</’.  Since  the  symbol  e means  an  empty  transition,  replacing  it  means  in- 
serting something.  The  symbol  ' indicates  a morpheme  boundary.  These 
boundaries  arc  deleted  by  including  the  symbol  in  the  default  pairs  for 
the  transducer;  thus  morpheme  boundary  markers  arc  deleted  on  the  surface 
level  by  default.  (Recall  that  the  colon  is  used  to  separate  symbols  on  the  in- 
termediate and  surface  forms).  The  # symbol  is  a special  symbol  that  marks 
a word  boundary.  Thus  (3.5)  means  ‘insert  an  e after  a morpheme-final  x, 
s,  or  z,  and  before  the  morpheme  s’.  Figure  3.14  shows  an  automaton  that 
corresponds  to  this  rule. 


The  idea  in  building  a transducer  for  a particular  rule  is  to  express  only 
the  constraints  necessary  for  that  rule,  allowing  any  other  string  of  symbols 
to  pass  through  unchanged.  This  rule  is  used  to  insure  that  we  can  only 
see  the  e:e  pair  if  we  arc  in  the  proper  context.  So  state  <70,  which  models 
having  seen  only  default  pairs  unrelated  to  the  rule,  is  an  accepting  state, 
as  is  qi,  which  models  having  seen  a z,  s,  or  x.  </?  models  having  seen  the 
morpheme  boundary  after  the  z,  s , or  x,  and  again  is  an  accepting  state.  State 
<73  models  having  just  seen  the  E-insertion;  it  is  not  an  accepting  state,  since 
the  insertion  is  only  allowed  if  it  is  followed  by  the  s morpheme  and  then  the 
end-of-word  symbol  #. 

The  other  symbol  is  used  in  Figure  3.14  to  safely  pass  through  any 
parts  of  words  that  don’t  play  a role  in  the  E-insertion  rule,  other  means 
‘any  feasible  pair  that  is  not  in  this  transducer’;  it  is  thus  a version  of  @;@ 
which  is  context-dependent  in  a transducer-by-transducer  way.  So  for  exam- 
ple when  leaving  state  <70,  we  go  to  <71  on  the  z,  s,  or  x symbols,  rather  than 
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following  the  other  arc  and  staying  in  qo.  The  semantics  of  other  depends 
on  what  symbols  arc  on  other  arcs;  since  # is  mentioned  on  some  arcs,  it 
is  (by  definition)  not  included  in  other,  and  thus,  for  example,  is  explicitly 
mentioned  on  the  arc  from  <72  to  qo- 

A transducer  needs  to  correctly  reject  a string  that  applies  the  rule  when 
it  shouldn't.  One  possible  bad  string  would  have  the  correct  environment  for 
the  E-insertion,  but  have  no  insertion.  State  <75  is  used  to  insure  that  the  e 
is  always  inserted  whenever  the  environment  is  appropriate;  the  transducer 
reaches  75  only  when  it  has  seen  an  s after  an  appropriate  morpheme  bound- 
ary If  the  machine  is  in  state  <75  and  the  next  symbol  is  #,  the  machine  rejects 
the  string  (because  there  is  no  legal  transition  on  # from  75).  Figure  3.15 
shows  the  transition  table  for  the  rule  which  makes  the  illegal  transitions 
explicit  with  the  symbol. 


State  \ Input 

x : x 

mm 

mm 

# 

other 

qo'- 

1 

1 

1 

0 

- 

0 

0 

<7i : 
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1 
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- 

0 
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q2: 

5 
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1 

0 

3 

0 

0 

qi 

4 

- 

- 

- 

- 

- 

- 

q4 

- 

- 

- 

- 

- 

0 

- 

<75 

1 

1 

1 

2 

- 

- 

0 

Figure  3.15 

The  state-transition  table  for  E-insertion  rule  of  Figure  3.14, 

extended  from  a similar  transducer  in  Antworth  (1990). 

The  next  section  will  show  a trace  of  this  E-insertion  transducer  run- 
ning on  a sample  input  string. 


3.3  Combining  FST  Lexicon  and  Rules 

We  arc  now  ready  to  combine  our  lexicon  and  rule  transducers  for  parsing 
and  generating.  Figure  3.16  shows  the  architecture  of  a two-level  morphol- 
ogy system,  whether  used  for  parsing  or  generating.  The  lexicon  transducer 
maps  between  the  lexical  level,  with  its  stems  and  morphological  features, 
and  an  intermediate  level  that  represents  a simple  concatenation  of  mor- 
phemes. Then  a host  of  transducers,  each  representing  a single  spelling  rule 
constraint,  all  run  in  parallel  so  as  to  map  between  this  intermediate  level  and 
the  surface  level.  Putting  all  the  spelling  rules  in  parallel  is  a design  choice; 
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we  could  also  have  chosen  to  run  all  the  spelling  rules  in  series  (as  a long 
cascade),  if  we  slightly  changed  each  rule. 


Lexical  ^ 
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+PL 
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Figure  3.16  Generating  or  Parsing  with  FST  lexicon  and  rules 

The  architecture  in  Figure  3.16  is  a two-level  cascade  of  transducers. 
Recall  that  a cascade  is  a set  of  transducers  in  series,  in  which  the  output 
from  one  transducer  acts  as  the  input  to  another  transducer;  cascades  can 
be  of  arbitrary  depth,  and  each  level  might  be  built  out  of  many  individual 
transducers.  The  cascade  in  Figure  3.16  has  two  transducers  in  series:  the 
transducer  mapping  from  the  lexical  to  the  intermediate  levels,  and  the  col- 
lection of  parallel  transducers  mapping  from  the  intermediate  to  the  surface 
level.  The  cascade  can  be  run  top-down  to  generate  a string,  or  bottom-up 
to  parse  it;  Figure  3.17  shows  a trace  of  the  system  accepting  the  mapping 
from  fox's  to  foxes. 

The  power  of  finite-state  transducers  is  that  the  exact  same  cascade 
with  the  same  state  sequences  is  used  when  the  machine  is  generating  the 
surface  tape  from  the  lexical  tape,  or  when  it  is  parsing  the  lexical  tape  from 
the  surface  tape.  For  example,  for  generation,  imagine  leaving  the  Interme- 
diate and  Surface  tapes  blank.  Now  if  we  run  the  lexicon  transducer,  given 
fox  +N  +PL,  it  will  produce  fox's#  on  the  Intermediate  tape  via  the  same 
states  that  it  accepted  the  Lexical  and  Intermediate  tapes  in  our  earlier  exam- 
ple. If  we  then  allow  all  possible  orthographic  transducers  to  run  in  parallel, 
we  will  produce  the  same  surface  tape. 

Parsing  can  be  slightly  more  complicated  than  generation,  because  of 
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Figure  3.17  Accepting  /oxev:  The  lexicon  transducer  7)eY  from  Figure  3.11 
cascaded  with  the  E-insertion  transducer  in  Figure  3.14. 


the  problem  of  ambiguity.  For  example,  foxes  can  also  be  a verb  (albeit 
a rare  one,  meaning  ‘to  baffle  or  confuse’),  and  hence  the  lexical  parse  for 
foxes  could  be  fox  +V  +3SG  as  well  as  fox  +N  +PL.  How  are  we  to 
know  which  one  is  the  proper  parse?  In  fact,  for  ambiguous  cases  of  this  sort, 
the  transducer  is  not  capable  of  deciding.  Disambiguating  will  require  some 
external  evidence  such  as  the  surrounding  words.  Thus  foxes  is  likely  to  be 
a noun  in  the  sequence  I saw  two  foxes  yesterday,  but  a verb  in  the  sequence 
That  trickster  foxes  me  every  time!.  We  will  discuss  such  disambiguation 
algorithms  in  Chapter  8 and  Chapter  17.  Bailing  such  external  evidence,  the 
best  our  transducer  can  do  is  just  enumerate  the  possible  choices;  so  we  can 
transduce  fox's#  into  both  fox  +V  +3SGandfox  +N  +PL. 

There  is  a kind  of  ambiguity  that  we  need  to  handle:  local  ambiguity 
that  occurs  during  the  process  of  parsing.  For  example,  imagine  parsing  the 
input  verb  assess.  After  seeing  ass,  our  E-insertion  transducer  may  propose 
that  the  e that  follows  is  inserted  by  the  spelling  rule  (for  example,  as  far  as 
the  transducer  is  concerned,  we  might  have  been  parsing  the  word  asses).  It 
is  not  until  we  don’t  see  the  # after  asses,  but  rather  run  into  another  s,  that 
we  realize  we  have  gone  down  an  incorrect  path. 

Because  of  this  non-determinism,  FST-parsing  algorithms  need  to  in- 
corporate some  sort  of  search  algorithm.  Exercise  3.8  asks  the  reader  to 
modify  the  algorithm  for  non-deterministic  FSA  recognition  in  Figure  2.21 
in  Chapter  2 to  do  FST  parsing. 
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Running  a cascade,  particularly  one  with  many  levels,  can  be  unwieldy. 
Luckily,  we’ve  already  seen  how  to  compose  a cascade  of  transducers  in  se- 
ries into  a single  more  complex  transducer.  Transducers  in  parallel  can  be 
(ntersec^  combined  by  automaton  intersection.  The  automaton  intersection  algo- 
rithm  just  takes  the  Cartesian  product  of  the  states,  i.e.  for  each  state  q,  in 
machine  1 and  state  qj  in  machine  2,  we  create  a new  state  q,j.  Then  for 
any  input  symbol  a , if  machine  1 would  transition  to  state  qn  and  machine  2 
would  transition  to  state  qm,  we  transition  to  state  qnm. 

Figure  3.18  sketches  how  this  intersection  (A)  and  composition  (o)  pro- 
cess might  be  carried  out. 
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Figure  3.18  Intersection  and  composition  of  transducers. 


Since  there  arc  a number  of  rule— >FST  compilers,  it  is  almost  never 
necessary  in  practice  to  write  an  FST  by  hand.  Kaplan  and  Kay  (1994)  give 
the  mathematics  that  define  the  mapping  from  rules  to  two-level  relations, 
and  Antworth  (1990)  gives  details  of  the  algorithms  for  rule  compilation. 
Mohi'i  (1997)  gives  algorithms  for  transducer  minimization  and  de termini za- 
tion. 


3.4  Lexicon-free  FSTs:  The  Porter  Stemmer 

While  building  a transducer  from  a lexicon  plus  rules  is  the  standard  al- 
gorithm for  morphological  parsing,  there  arc  simpler  algorithms  that  don’t 
require  the  large  on-line  lexicon  demanded  by  this  algorithm.  These  arc  used 
especially  in  Information  Retrieval  (IR)  tasks  (Chapter  17)  in  which  a user 
needs  some  information,  and  is  looking  for  relevant  documents  (perhaps  on 
the  web,  perhaps  in  a digital  library  database).  She  gives  the  system  a query 
with  some  important  characteristics  of  documents  she  desires,  and  the  IR 
system  retrieves  what  it  thinks  arc  the  relevant  documents.  One  common 
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type  of  query  is  Boolean  combinations  of  relevant  keywords  or  phrases,  e.g. 
(, marsupial  OR  kangaroo  OR  koala).  The  system  then  returns  documents  that 
have  these  words  in  them.  Since  a document  with  the  word  marsupials  might 
not  match  the  keyword  marsupial,  some  IR  systems  first  run  a stemmer  on 
the  keywords  and  on  the  words  in  the  document.  Since  morphological  pars- 
ing in  IR  is  only  used  to  help  form  equivalence  classes,  the  details  of  the 
suffixes  arc  irrelevant;  what  matters  is  determining  that  two  words  have  the 
same  stem. 

One  of  the  most  widely  used  such  stemming  algorithms  is  the  simple 
and  efficient  Porter  (1980)  algorithm,  which  is  based  on  a series  of  simple 
cascaded  rewrite  rules.  Since  cascaded  rewrite  rules  arc  just  the  sort  of  thing 
that  could  be  easily  implemented  as  an  FST,  we  think  of  the  Porter  algorithm 
as  a lexicon-free  FST  stemmer  (this  idea  will  be  developed  further  in  the 
exercises  (Exercise  3.7).  The  algorithm  contains  rules  like: 

(3.6)  ATIONAL  — » ATE  (e.g.  relational  — » relate) 

(3.7)  ING  — > £ if  stem  contains  vowel  (e.g.  motoring  — > motor) 

The  algorithm  is  presented  in  detail  in  Appendix  B. 

Do  stemmers  really  improve  the  performance  of  information  retrieval 
engines?  One  problem  is  that  stemmers  are  not  perfect.  For  example  Krovetz 
(1993)  summarizes  the  following  kinds  of  errors  of  omission  and  of  commis- 
sion in  the  Porter  algorithm: 


Errors  of  Commission  Errors  of  Omission 


organization 

organ 

European 

Europe 

doing 

doe 

analysis 

analyzes 

generalization 

generic 

matrices 

matrix 

numerical 

numerous 

noise 

noisy 

policy 

police 

sparse 

sparsity 

university 

universe 

explain 

explanation 

negligible 

negligent 

urgency 

urgent 

Krovetz  also  gives  the  results  of  a number  of  experiments  testing  whether 
the  Porter  stemmer  actually  improved  IR  performance.  Overall  he  found 
some  improvement,  especially  with  smaller  documents  (the  larger  the  docu- 
ment, the  higher  the  chance  the  keyword  will  occur  in  the  exact  form  used 
in  the  query).  Since  any  improvement  is  quite  small,  IR  engines  often  don’t 
use  stemming. 
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3.5  Human  Morphological  Processing 


In  this  section  we  look  at  psychological  studies  to  learn  how  multi-morphemic 
words  arc  represented  in  the  minds  of  speakers  of  English.  For  example,  con- 
sider the  word  walk  and  its  inflected  forms  walks , and  walked.  Are  all  three 
in  the  human  lexicon?  Or  merely  walk  plus  as  well  as  -ed  and  -si  How 
about  the  word  happy  and  its  derived  forms  happily  and  happinessl  We  can 
imagine  two  ends  of  a theoretical  spectrum  of  representations.  The  full  list- 
full  listing  ing  hypothesis  proposes  that  all  words  of  a language  arc  listed  in  the  mental 
lexicon  without  any  internal  morphological  structure.  On  this  view,  mor- 
phological structure  is  simply  an  epiphenomenon,  and  walk,  walks , walked, 
happy,  and  happily  arc  all  separately  listed  in  the  lexicon.  This  hypothesis 
is  certainly  untenable  for  morphologically  complex  languages  like  Turkish 
(Hankamer  (1989)  estimates  Turkish  as  200  billion  possible  words).  The 
redundancy  minimum  redundancy  hypothesis  suggests  that  only  the  constituent  mor- 
phemes arc  represented  in  the  lexicon,  and  when  processing  walks,  (whether 
for  reading,  listening,  or  talking)  we  must  access  both  morphemes  ( walk  and 
-s)  and  combine  them. 

Most  modern  experimental  evidence  suggests  that  neither  of  these  is 
completely  true.  Rather,  some  kinds  of  morphological  relationships  arc  men- 
tally represented  (particularly  inflection  and  certain  kinds  of  derivation),  but 
others  arc  not,  with  those  words  being  fully  listed.  Stanners  el  al.  (1979), 
for  example,  found  that  derived  forms  ( happiness , happily ) arc  stored  sepa- 
rately from  their  stem  {happy),  but  that  regularly  inflected  forms  {pouring ) 
arc  not  distinct  in  the  lexicon  from  their  stems  {pour).  They  did  this  by  using 
a repetition  priming  experiment.  In  short,  repetition  priming  takes  advantage 
of  the  fact  that  a word  is  recognized  faster  if  it  has  been  seen  before  (if  it  is 
primed  primed).  They  found  that  lifting  primed  lift,  and  burned  primed  burn,  but 
for  example  selective  didn't  prime  select.  Figure  3.19  sketches  one  possible 
representation  of  their  finding: 


-s 


-ing 


Figure  3.19  Stanners  et  al.  (1979)  result:  Different  representations  of  in- 
flection and  derivation 
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In  a more  recent  study,  Marslen-Wilson  et  al.  (1994)  found  that  spoken 
derived  words  can  prime  their  stems,  but  only  if  the  meaning  of  the  derived 
form  is  closely  related  to  the  stem.  For  example  government  primes  govern, 
but  department  does  not  prime  depart.  Grainger  et  al.  (1991)  found  similar 
results  with  prefixed  words  (but  not  with  suffixed  words).  Marslen-Wilson 
et  al.  (1994)  represent  a model  compatible  with  their  own  findings  as  fol- 


Other  evidence  that  the  human  lexicon  represents  some  morphological 
structure  comes  from  speech  errors,  also  called  slips  of  the  tongue.  In 
normal  conversation,  speakers  often  mix  up  the  order  of  the  words  or  initial 
sounds: 

if  you  break  it  it'll  drop 

I don’t  have  time  to  work  to  watch  television  because  I have  to 
work 

But  inflectional  and  derivational  affixes  can  also  appeal-  separately  from 
their  stems,  as  these  examples  from  Fromkin  and  Ratner  (1998)  and  Garrett 
(1975)  show: 

it’s  not  only  us  who  have  screw  looses  (for  ‘screws  loose’) 
words  of  rule  formation  (for  ‘rules  of  word  formation’) 
easy  enoughly  (for  ‘easily  enough') 

which  by  itself  is  the  most  unimplausible  sentence  you  can  imagine 


The  ability  of  these  affixes  to  be  produced  separately  from  their  stem 
suggests  that  the  mental  lexicon  must  contain  some  representation  of  the 
morphological  structure  of  these  words. 

In  summary,  these  results  suggest  that  morphology  does  play  a role  in 
the  human  lexicon,  especially  productive  morphology  like  inflection.  They 
also  emphasize  the  important  of  semantic  generalizations  across  words,  and 
suggest  that  the  human  auditory  lexicon  (representing  words  in  terms  of  their 
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sounds)  and  the  orthographic  lexicon  (representing  words  in  terms  of  letters) 
may  have  similar  structures.  Finally,  it  seems  that  many  properties  of  lan- 
guage processing,  like  morphology,  may  apply  equally  (or  at  least  similarly) 
to  language  comprehension  and  language  production. 


3.6  Summary 

This  chapter  introduced  morphology,  the  arena  of  language  processing  deal- 
ing with  the  subparts  of  words,  and  the  finite-state  transducer,  the  com- 
putational device  that  is  commonly  used  to  model  morphology.  Here’s  a 
summary  of  the  main  points  we  covered  about  these  ideas: 

• morphological  parsing  is  the  process  of  finding  the  constituent  mor- 
phemes in  a word  (e.g.  cat  +N  +PL  for  ofis)- 

• English  mainly  uses  prefixes  and  suffixes  to  express  inflectional  and 
derivational  morphology. 

• English  inflectional  morphology  is  relatively  simple  and  includes  per- 
son and  number  agreement  ( -s ) and  tense  markings  ( -ed  and  ing). 

• English  derivational  morphology  is  more  complex  and  includes  suf- 
fixes like  -ation,  -ness,  -able  as  well  as  prefixes  like  co-  and  re-. 

• many  constraints  on  the  English  morphotactics  (allowable  morpheme 
sequences)  can  be  represented  by  finite  automata. 

• finite-state  transducers  arc  an  extension  of  finite-state  automata  that 
can  generate  output  symbols. 

• two-level  morphology  is  the  application  of  finite-state  transducers  to 
morphological  representation  and  parsing. 

• spelling  rules  can  be  implemented  as  transducers. 

• there  are  automatic  transducer-compilers  that  can  produce  a transducer 
for  any  simple  rewrite  rule. 

• the  lexicon  and  spelling  rules  can  be  combined  by  composing  and  in- 
tersecting various  transducers. 

• the  Porter  algorithm  is  a simple  and  efficient  way  to  do  stemming, 
stripping  off  affixes.  It  is  not  as  accurate  as  a transducer  model  that  in- 
cludes a lexicon,  but  may  be  preferable  for  applications  like  informa- 
tion retrieval  in  which  exact  morphological  structure  is  not  needed. 
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Bibliographical  and  Historical  Notes 


Despite  the  close  mathematical  similarity  of  finite-state  transducers  to  finite- 
state  automata,  the  two  models  grew  out  of  somewhat  different  traditions. 
Chapter  2 described  how  the  finite  automaton  grew  out  of  Turing’s  (1936) 
model  of  algorithmic  computation,  and  McCulloch  and  Pitts  finite-state-like 
models  of  the  neuron.  The  influence  of  the  Turing  machine  on  the  trans- 
ducer was  somewhat  more  indirect.  Huffman  (1954)  proposed  what  was 
essentially  a state-transition  table  to  model  the  behavior  of  sequential  cir- 
cuits, based  on  the  work  of  Shannon  (1938)  on  an  algebraic  model  of  relay 
circuits.  Based  on  Turing  and  Shannon’s  work,  and  unaware  of  Huffman’s 
work,  Moore  (1956)  introduced  the  term  finite  automaton  for  a machine 
with  a finite  number  of  states  with  an  alphabet  of  input  symbols  and  an  al- 
phabet of  output  symbols.  Mealy  (1955)  extended  and  synthesized  the  work 
of  Moore  and  Huffman. 

The  finite  automata  in  Moore’s  original  paper,  and  the  extension  by 
Mealy  differed  in  an  important  way.  In  a Mealy  machine,  the  input/output 
symbols  arc  associated  with  the  transitions  between  states.  The  finite-state 
transducers  in  this  chapter  arc  Mealy  machines.  In  a Moore  machine,  the 
input/output  symbols  arc  associated  with  the  state;  we  will  see  examples  of 
Moore  machines  in  Chapter  5 and  Chapter  7.  The  two  types  of  transduc- 
ers arc  equivalent;  any  Moore  machine  can  be  converted  into  an  equivalent 
Mealy  machine  and  vice  versa. 

Many  early  programs  for  morphological  parsing  used  an  affix-stripping 
approach  to  parsing.  For  example  Packard’s  (1973)  parser  for  ancient  Greek 
iteratively  stripped  prefixes  and  suffixes  off  the  input  word,  making  note  of 
them,  and  then  looked  up  the  remainder  in  a lexicon.  It  returned  any  root  that 
was  compatible  with  the  stripped-off  affixes.  This  approach  is  equivalent  to 
the  bottom-up  method  of  parsing  that  we  will  discuss  in  Chapter  10. 

AMPLE  (A  Morphological  Parser  for  Linguistic  Exploration)  (Weber 
and  Mann,  1981;  Weber  et  al.,  1988;  Hankamer  and  Black,  1991)  is  another 
early  bottom-up  morphological  parser.  It  contains  a lexicon  with  all  possible 
surface  valiants  of  each  morpheme  (these  arc  called  allomorphs),  together 
with  constraints  on  their  occurrence  (for  example  in  English  the  -es  allo- 
morph  of  the  plural  morpheme  can  only  occur  after  s,  x,  z,  sh,  or  ch).  The 
system  finds  every  possible  sequence  of  morphemes  which  match  the  input 
and  then  filters  out  all  the  sequences  which  have  failing  constraints. 


Chapter  3.  Morphology  and  Finite-State  Transducers 


An  alternative  approach  to  morphological  parsing  is  called  generate- 
and-test  or  analysis-by-synthesis  approach.  Hankamer’s  (1986)  keCi  is  a 
morphological  parser  for  Turkish  which  is  guided  by  a finite-state  represen- 
tation of  Turkish  morphemes.  The  program  begins  with  a morpheme  that 
might  match  the  left  edge  of  the  word,  and  applies  every  possible  phonolog- 
ical rule  to  it,  checking  each  result  against  the  input.  If  one  of  the  outputs 
succeeds,  the  program  then  follows  the  finite-state  morphotactics  to  the  next 
morpheme  and  tries  to  continue  matching  the  input. 

The  idea  of  modeling  spelling  rules  as  finite-state  transducers  is  really 
based  on  Johnson’s  (1972)  early  idea  that  phonological  rules  (to  be  discussed 
in  Chapter  4)  have  finite-state  properties.  Johnson’s  insight  unfortunately  did 
not  attract  the  attention  of  the  community,  and  was  independently  discovered 
by  Roland  Kaplan  and  Martin  Kay,  first  in  an  unpublished  talk  Kaplan  and 
Kay  (1981)  and  then  finally  in  print  (Kaplan  and  Kay,  1994).  Kaplan  and 
Kay’s  work  was  followed  up  and  most  fully  worked  out  by  Koskenniemi 
(1983),  who  described  finite-state  morphological  rules  for  Finnish.  Kart- 
tunen  (1983)  built  a program  called  KIMMO  based  on  Koskenniemi’s  mod- 
els. Antworth  (1990)  gives  many  details  of  two-level  morphology  and  its 
application  to  English.  Besides  Koskenniemi’s  work  on  Finnish  and  that  of 
Antworth  (1990)  on  English,  two-level  or  other  finite-state  models  of  mor- 
phology have  been  worked  out  for  many  languages,  such  as  Turkish  (Oflazer, 
1993)  and  Arabic  (Beesley,  1996).  Antworth  (1990)  summarizes  a number 
of  issues  in  finite-state  analysis  of  languages  with  morphologically  complex 
processes  like  infixation  and  reduplication  (for  example  Tagalog)  and  gem- 
ination (for  example  Hebrew).  Karttunen  (1993)  is  a good  summary  of  the 
application  of  two-level  morphology  specifically  to  phonological  rules  of  the 
sort  we  will  discuss  in  Chapter  4.  Barton  el  al.  (1987)  bring  up  some  com- 
putational complexity  problems  with  two-level  models,  which  arc  responded 
to  by  Koskenniemi  and  Church  (1988). 

Students  interested  in  further  details  of  the  fundamental  mathematics 
of  automata  theory  should  see  Hopcroft  and  Ullman  (1979)  or  Lewis  and 
Papadimitriou  (1981).  Mohri  (1997)  and  Roche  and  Schabes  (1997b)  give 
additional  algorithms  and  mathematical  foundations  for  language  applica- 
tions, including  e.g.  the  details  of  the  algorithm  for  transducer  minimization. 
Sproat  (1993)  gives  abroad  general  introduction  to  computational  morphol- 
ogy- 


Section  3.6.  Summary 


89 


Exercises 


3.1  Add  some  adjectives  to  the  adjective  FSA  in  Figure  3.5. 

3.2  Give  examples  of  each  of  the  noun  and  verb  classes  in  Figure  3.6,  and 
find  some  exceptions  to  the  rules. 

3.3  Extend  the  transducer  in  Figure  3. 14  to  deal  with  sh  and  ch. 

3.4  Write  a transducer(s)  for  the  K insertion  spelling  rule  in  English. 

3.5  Write  a transducer(s)  for  the  consonant  doubling  spelling  rule  in  En- 
glish. 

3.6  The  Soundex  algorithm  (Odell  and  Russell,  1922;  Knuth,  1973)  is  a 
method  commonly  used  in  libraries  and  older  Census  records  for  represent- 
ing people’s  names.  It  has  the  advantage  that  versions  of  the  names  that  arc 
slightly  misspelled  or  otherwise  modified  (common,  for  example,  in  hand- 
written census  records)  will  still  have  the  same  representation  as  correctly- 
spelled  names.  (For  example,  Jurafsky,  Jarofsky,  Jarovsky,  and  Jarovski  all 
map  to  J612). 

a.  Keep  the  first  letter  of  the  name,  and  drop  all  occurrences  of  non-initial 
a,  e,  h,  i,  o,  u,  w,  y 

b.  Replace  the  remaining  letters  with  the  following  numbers: 

b,  f,  p,  v — > 1 

c,  g,  j,  k,  q,  s,  x,  z — >•  2 

d,  t — > 3 
1 ->•  4 
m,  n — > 5 
r — » 6 

c.  Replace  any  sequences  of  identical  numbers  with  a single  number  (i.e. 
666  -> 6) 

d.  Convert  to  the  form  Letter  Digit  Digit  Digit  by  dropping 
digits  past  the  third  (if  necessary)  or  padding  with  trailing  zeros  (if 
necessary). 

The  exercise:  write  a FST  to  implement  the  Soundex  algorithm. 

3.7  Implement  one  of  the  steps  of  the  Porter  Stemmer  as  a transducer. 
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3.8  Write  the  algorithm  for  parsing  a finite-state  transducer,  using  the  pseudo- 
code introduced  in  Chapter  2.  You  should  do  this  by  modifying  the  algorithm 
nd-recognize  in  Figure  2.21  in  Chapter  2. 

3.9  Write  a program  that  takes  a word  and,  using  an  on-line  dictionary, 
computes  possible  anagrams  of  the  word. 

3.10  In  Figure  3.14,  why  is  there  a z,  s,  x arc  from  q$  to  q \ ? 


COMPUTATIONAL 
PHONOLOGY  AND 
TEXT-TO-SPEECH 


You  like  po-tay-to  and  I like  po-tah-to. 

You  like  to-may-to  and  I like  to-mah-to. 

Po-tay-to,  po-tah-to, 

To-may-to,  to-mah-to, 

Let’s  call  the  whole  thing  off! 

George  and  Ira  Gershwin,  Let’s  Call  the  Whole  Thing  Off 
from  Shall  We  Dance,  1937 


The  previous  chapters  have  all  dealt  with  language  in  text  format.  We  now 
turn  to  speech.  The  next  four  chapters  will  introduce  the  fundamental  in- 
sights and  algorithms  necessary  to  understand  modern  speech  recognition 
and  speech  synthesis  technology,  and  the  related  branch  of  linguistics  called 

computational  phonology. 

Let’s  begin  by  defining  these  areas.  The  core  task  of  automatic  speech 
recognition  is  take  an  acoustic  waveform  as  input  and  produce  as  output 
a string  of  words.  The  core  task  of  text-to-speech  synthesis  is  to  take  a 
sequence  of  text  words  and  produce  as  output  an  acoustic  waveform.  The 
uses  of  speech  recognition  and  synthesis  arc  manifold,  including  automatic 
dictation/transcription,  speech-based  interfaces  to  computers  and  telephones, 
voice-based  input  and  output  for  the  disabled,  and  many  others  that  will  be 
discussed  in  greater  detail  in  Chapter  7. 

This  chapter  will  focus  on  an  important  paid  of  both  speech  recognition 
and  text-to-speech  systems:  how  words  arc  pronounced  in  terms  of  individ- 
ual speech  units  called  phones.  A speech  recognition  system  needs  to  have 
a pronunciation  for  every  word  it  can  recognize,  and  a text-to-speech  system 
needs  to  have  a pronunciation  for  every  word  it  can  say.  The  first  section  of 
this  chapter  will  introduce  phonetic  alphabets  for  describing  pronunciation. 
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4.1 


PHONETICS 


PHONES 


part  of  the  field  of  phonetics.  We  then  introduce  articulatory  phonetics,  the 
study  of  how  speech  sounds  arc  produced  by  articulators  in  the  mouth. 

Modeling  pronunciation  would  be  much  simpler  if  a given  phone  was 
always  pronounced  the  same  in  every  context.  Unfortunately  this  is  not  the 
case.  As  we  will  see,  the  phone  [t]  is  pronounced  very  differently  in  different 
phonetic  environments.  Phonology  is  the  area  of  linguistics  that  describes 
the  systematic  way  that  sounds  are  differently  realized  in  different  environ- 
ments, and  how  this  system  of  sounds  is  related  to  the  rest  of  the  grammar. 
The  next  section  of  the  chapter  will  describe  the  way  we  write  phonological 
rules  to  describe  these  different  realizations. 

We  next  introduce  an  area  known  as  computational  phonology.  One 
important  paid  of  computational  phonology  is  the  study  of  computational 
mechanisms  for  modeling  phonological  rules.  We  will  show  how  the  spelling- 
rule  transducers  of  Chapter  3 can  be  used  to  model  phonology.  We  then 
discuss  computational  models  of  phonological  learning:  how  phonological 
rules  can  be  automatically  induced  by  machine  learning  algorithms. 

Finally,  we  apply  the  transducer-based  model  of  phonology  to  an  im- 
portant problem  in  text-to-speech  systems:  mapping  from  strings  of  letters 
to  strings  of  phones.  We  first  survey  the  issues  involved  in  building  a large 
pronunciation  dictionary,  and  then  show  how  the  transducer-based  lexicons 
and  spelling  rules  of  Chapter  3 can  be  augmented  with  pronunciations  to 
map  from  orthography  to  pronunciation. 

This  chapter  focuses  on  the  non-probabilistic  areas  of  computational 
linguistics  and  pronunciations  modeling.  Chapter  5 will  turn  to  the  role  of 
probabilistic  models,  including  such  areas  as  probabilistic  models  of  pronun- 
ciation variation  and  probabilistic  methods  for  learning  phonological  rules. 


Speech  Sounds  and  Phonetic  Transcription 

The  study  of  the  pronunciation  of  words  is  paid  of  the  field  of  phonetics,  the 
study  of  the  speech  sounds  used  in  the  languages  of  the  world.  We  will  be 
modeling  the  pronunciation  of  a word  as  a string  of  symbols  which  represent 
phones  or  segments.  A phone  is  a speech  sound;  we  will  represent  phones 
with  phonetic  symbols  that  hears  some  resemblance  to  a letter  in  an  alpha- 
betic language  like  English.  So  for  example  there  is  a phone  represented  by  / 
that  usually  corresponds  to  the  letter  / and  a phone  represented  by  p that  usu- 
ally corresponds  to  the  letter  p.  Actually,  as  we  will  see  later,  phones  have 
much  more  variation  than  letters  do.  This  chapter  will  only  briefly  touch 
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on  other  aspects  of  phonetics  such  as  prosody,  which  includes  things  like 
changes  in  pitch  and  duration. 


IPA 

ARPAbet 

IPA 

ARPAbet 

Symbol 

Symbol 

Word 

Transcription 

Transcription 

[Pi 

[Pi 

parsley 

[ parsli] 

[p  aa  r s 1 iy] 

[t] 

[t] 

tarragon 

[taerogan] 

[t  ae  r ax  g aa  n] 

[k] 

[k] 

catnip 

['kastnip] 

[k  ae  t n ix  p] 

[b] 

[b] 

bay 

[bei] 

[bey] 

[d] 

[d] 

dill 

[dd] 

[d  ih  1] 

[g] 

[g] 

garlic 

[garlik] 

[g  aa  r 1 ix  k] 

[m] 

[m] 

mint 

[mint] 

[m  ih  n t] 

[n] 

[n] 

nutmeg 

['nAtmeg] 

[n  ah  t m eh  g 

I'll 

[ng] 

ginseng 

[ d3insig] 

Oh  ih  n s ix  ng] 

m 

m 

fennel 

[fenl] 

[f  eh  n el] 

[v] 

[v] 

clove 

[klouv] 

[k  1 ow  v] 

[0] 

[th] 

thistle 

[’Grsl] 

[th  ih  s el] 

[3] 

[dh] 

heather 

[heSa1] 

[h  eh  dh  axr] 

[s] 

[s] 

sage 

[seid3] 

[s  ey  jh] 

[z] 

[z] 

hazelnut 

[heizlnAt] 

[h  ey  z el  n ah  t] 

01 

[sh] 

squash 

[skwaj] 

[s  k w a sh] 

[3l 

[zh] 

ambrosia 

[asm'brougo] 

[ae  m b r ow  zh  ax] 

[t|l 

[ch] 

chicory 

[tjTk^i] 

[ch  ih  k axr-  iy  ] 

[d3l 

Oh] 

sage 

[seid3] 

[s  ey  jh] 

[1] 

[1] 

licorice 

[likanj] 

[1  ih  k axr  ix  sh] 

[w] 

[w] 

kiwi 

['kiwi] 

[k  iy  w iy] 

M 

[r] 

parsley 

['parsli] 

[p  aa  r s 1 iy] 

01 

[y] 

yew 

[yu] 

[y  uw] 

[h] 

[h] 

horseradish 

[horsraedij] 

[h  ao  r s r ae  d ih  sh] 

[?] 

[q] 

uh-oh 

[?a?ou] 

[q  ah  q ow] 

[r] 

[dx] 

butter 

['bAra1] 

[b  ah  dx  axr-  ] 

[f] 

[nx] 

wintergreen 

[wira^grin] 

[w  ih  nx  axr  grin] 

[1] 

[el] 

thistle 

['Grsl] 

[th  ih  s el] 

Figure 

4.1  IPA 

and  ARPAbet 

symbols  for  transcription  of  English 

consonants. 

This  section  surveys  the  different  phones  of  English,  particularly  Amer- 
ican English,  showing  how  they  arc  produced  and  how  they  arc  represented 
symbolically.  We  will  be  using  two  different  alphabets  for  describing  phones. 
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The  first  is  the  International  Phonetic  Alphabet  (IPA).  The  IPA  is  an  evolv- 
ing standard  originally  developed  by  the  International  Phonetic  Association 
in  1888  with  the  goal  of  transcribing  the  sounds  of  all  human  languages. 
The  IPA  is  not  just  an  alphabet  but  also  a set  of  principles  for  transcription, 
which  differ  according  to  the  needs  of  the  transcription,  so  the  same  utter- 
ance can  be  transcribed  in  different  ways  all  according  to  the  principles  of 
the  IPA.  In  the  interests  of  brevity  in  this  book  we  will  focus  on  the  symbols 
that  arc  most  relevant  for  English;  thus  Figure  4. 1 shows  a subset  of  the  IPA 
symbols  for  transcribing  consonants,  while  Figure  4.2  shows  a subset  of  the 
IPA  symbols  for  transcribing  vowels.1  These  tables  also  give  the  ARPAbet 
symbols;  ARPAbet  (?)  is  another  phonetic  alphabet,  but  one  that  is  specifi- 
cally designed  for  American  English  and  which  uses  ASCII  symbols;  it  can 
be  thought  of  as  a convenient  ASCII  representation  of  an  American-English 
subset  of  the  IPA.  ARPAbet  symbols  arc  often  used  in  applications  where 
non-ASCII  fonts  are  inconvenient,  such  as  in  on-line  pronunciation  dictio- 
naries. 

Many  of  the  IPA  and  ARPAbet  symbols  arc  equivalent  to  the  Roman 
letters  used  in  the  orthography  of  English  and  many  other  languages.  So  for 
example  the  IPA  and  ARPAbet  symbol  [p]  represents  the  consonant  sound  at 
the  beginning  of  platypus,  puma,  and  pachyderm,  the  middle  of  leopard,  or 
the  end  of  antelope  (note  that  the  final  orthographic  e of  antelope  does  not 
correspond  to  any  final  vowel;  the  p is  the  last  sound). 

The  mapping  between  the  letters  of  English  orthography  and  IPA  sym- 
bols is  rarely  as  simple  as  this,  however.  This  is  because  the  mapping  be- 
tween English  orthography  and  pronunciation  is  quite  opaque;  a single  letter 
can  represent  very  different  sounds  in  different  contexts.  Figure  4.3  shows 
that  the  English  letter  c is  represented  as  IPA  [k]  in  the  word  cougar,  but  IPA 
[s]  in  the  word  civet.  Besides  appealing  as  c and  k,  the  sound  marked  as  [k] 
in  the  IPA  can  appeal-  as  part  of  x (fox),  as  ck  (jackal ),  and  as  cc  (raccoon). 
Many  other  languages,  for  example  Spanish,  are  much  more  transparent  in 
their  sound-orthography  mapping  than  English. 

The  Vocal  Organs 

We  turn  now  to  articulatory  phonetics,  the  study  of  how  phones  are  pro- 
duced, as  the  various  organs  in  the  mouth,  throat,  and  nose  modify  the  airflow 
from  the  lungs. 


1 For  simplicity  we  use  the  symbol  [r]  for  the  American  English  ‘r’  sound,  rather  than  the 
more  standard  IPA  symbol  [j]. 
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IPA 

Symbol 

ARPAbet 

Symbol 

Word 

IPA 

Transcription 

ARPAbet 

Transcription 

W 

[iy] 

lily 

[hli] 

[1  ih  1 iy] 

W 

[ih] 

lily 

[ hli] 

[1  ih  1 iy] 

[ei] 

[ey] 

daisy 

[’deizi] 

[d  ey  z i] 

[eh] 

poinsettia 

[poin'serio] 

[p  oy  n s eh  dx  iy  ax] 

[a] 

[ae] 

aster 

['assta1] 

[ae  s t axr] 

[a] 

[aa] 

poppy 

[papi] 

[p  aa  p i] 

M 

[ao] 

orchid 

[orkid] 

[ao  r k ix  d] 

M 

[uh] 

woodruff 

[wudrAf] 

[w  uh  d r ah  f] 

[ou] 

[ow] 

lotus 

[iouras] 

[1  ow  dx  ax  s] 

[n] 

[uw] 

tulip 

['tulip] 

[t  uw  1 ix  p] 

[a] 

[uh] 

buttercup 

['bAr^kAp] 

[b  uh  dx  axr  k uh  p] 

M 

[er] 

bird 

['byd] 

[b  er  d] 

[ai] 

[ay] 

iris 

['arris] 

[ay  r ix  s] 

[au] 

[aw] 

sunflower 

['sAiiflaua1] 

[s  ah  n f 1 aw  axr] 

[ot] 

[oy] 

poinsettia 

[poin'srrio] 

[p  oy  n s eh  dx  iy  ax] 

[ju] 

[y  uw] 

feverfew 

[fivaTju] 

[f  iy  v axr  f y u] 

[a] 

[ax] 

woodruff 

['wudrof] 

[w  uh  d r ax  f] 

M 

[axr] 

heather 

['heda1] 

[h  eh  dh  axr-] 

[ix] 

tulip 

['tulip] 

[t  uw  1 ix  p] 

[«] 

[ux] 

[] 

[] 

Figure  4.2  IPA  and  ARPAbet  symbols  for  transcription  of  English  vowels 

Word 

jackal 

raccoon 

cougar 

civet 

IPA 

['d3ae.kj] 

[ras.'kun] 

[ku.ga1] 

['si.vit] 

ARPAbet 

[jh  ae  k el] 

[r  ae  k uw  n] 

[k  uw  g axr] 

[s  ih  v ix  t] 

Figure  4.3  The  mapping  between  IPA  symbols  and  letters  in  English  or- 
thography is  complicated;  both  IPA  [k]  and  English  orthographic  [c]  have 
many  alternative  realizations 


Sound  is  produced  by  the  rapid  movement  of  air.  Most  sounds  in  hu- 
man languages  arc  produced  by  expelling  air  from  the  lungs  through  the 
windpipe  (technically  the  trachea)  and  then  out  the  mouth  or  nose.  As  it 
passes  through  the  trachea,  the  air  passes  through  the  larynx,  commonly 
known  as  the  Adam's  apple  or  voicebox.  The  larynx  contains  two  small 
folds  of  muscle,  the  vocal  folds  (often  referred  to  non-technically  as  the  vo- 
cal cords)  which  can  be  moved  together  or  apart.  The  space  between  these 
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GLOTTIS 


VOICED 

UNVOICED 

VOICELESS 


two  folds  is  called  the  glottis.  If  the  folds  arc  close  together  (but  not  tightly 
closed),  they  will  vibrate  as  air  passes  through  them;  if  they  arc  far  apart, 
they  won’t  vibrate.  Sounds  made  with  the  vocal  folds  together  and  vibrating 
arc  called  voiced;  sounds  made  without  this  vocal  cord  vibration  arc  called 
unvoiced  or  voiceless.  Voiced  sounds  include  [b],  [d],  [g],  [v],  [z],  and  all 
the  English  vowels,  among  others.  Unvoiced  sounds  include  [p],  [t],  [k],  [f], 
[z],  and  others. 

The  area  above  the  trachea  is  called  the  vocal  tract,  and  consists  of  the 
oral  tract  and  the  nasal  tract.  After  the  air  leaves  the  trachea,  it  can  exit  the 
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body  through  the  mouth  or  the  nose.  Most  sounds  arc  made  by  air  passing 
through  the  mouth.  Sounds  made  by  air  passing  through  the  nose  arc  called 
nasal  sounds;  nasal  sounds  use  both  the  oral  and  nasal  tracts  as  resonating  sounds 
cavities;  English  nasal  sounds  include  m,  and  n,  and  ng. 

Phones  are  divided  into  two  main  classes:  consonants  and  vowels,  nants 
Both  kinds  of  sounds  arc  formed  by  the  motion  of  air  through  the  mouth,  vowels 
throat  or  nose.  Consonants  arc  made  by  restricting  or  blocking  the  airflow  in 
some  way,  and  may  be  voiced  or  unvoiced.  Vowels  have  less  obstruction,  arc 
usually  voiced,  and  arc  generally  louder  and  longer-lasting  than  consonants. 

The  technical  use  of  these  terms  is  much  like  the  common  usage;  [p],  [b], 

[t],  [d],  [k],  [g],  [f],  [v],  [s],  [z],  [r],  [1],  etc.,  are  consonants;  [aa],  [ae],  [aw], 

[ao],  [ih],  [aw],  [ow],  [uw],  etc.,  are  vowels.  Semivowels  (such  as  [y]  and 
[w])  have  some  of  the  properties  of  both;  they  arc  voiced  like  vowels,  but 
they  arc  short  and  less  syllabic  like  consonants. 

Consonants:  Place  of  Articulation 

Because  consonants  arc  made  by  restricting  the  airflow  in  some  way,  con- 
sonants can  be  distinguished  by  where  this  restriction  is  made:  the  point 
of  maximum  restriction  is  called  the  place  of  articulation  of  a consonant,  place 
Places  of  articulation,  shown  in  Figure  4.5,  arc  often  used  in  automatic 
speech  recognition  as  a useful  way  of  grouping  phones  together  into  equiva- 
lence classes: 


• labial:  Consonants  whose  main  restriction  is  formed  by  the  two  lips  labial 
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ALVEOLAR 


PALATAL 

PALATE 


VELAR 

VELUM 


GLOTTAL 


MANNER 


STOP 


coming  together  have  a bilabial  place  of  articulation.  In  English  these 
include  [p]  as  in  possum,  [b]  as  in  bear,  and  [m]  as  in  marmot.  The  En- 
glish labiodental  consonants  [v]  and  [f]  arc  made  by  pressing  the  bot- 
tom lip  against  the  upper  row  of  teeth  and  letting  the  air  flow  through 
the  space  in  the  upper  teeth. 

• dental:  Sounds  that  arc  made  by  placing  the  tongue  against  the  teeth 
arc  dentals.  The  main  dentals  in  English  arc  the  [0]  of  thing  or  the  [5] 
of  though,  which  arc  made  by  placing  the  tongue  behind  the  teeth  with 
the  tip  slightly  between  the  teeth. 

• alveolar:  The  alveolar  ridge  is  the  portion  of  the  roof  of  the  mouth  just 
behind  the  upper  teeth.  Most  speakers  of  American  English  make  the 
phones  [s],  [z],  [t],  and  [d]  by  placing  the  tip  of  the  tongue  against  the 
alveolar  ridge. 

• palatal:  The  roof  of  the  mouth  (the  palate)  rises  sharply  from  the 
back  of  the  alveolar  ridge.  The  palato-alveolar  sounds  [J]  (shrimp), 
[tj]  ( chinchilla ),  [3]  (Asian),  and  [d3]  (jaguar)  are  made  with  the  blade 
of  the  tongue  against  this  rising  back  of  the  alveolar  ridge.  The  palatal 
sound  [y]  of  yak  is  made  by  placing  the  front  of  the  tongue  up  close  to 
the  palate. 

• velar:  The  velum  or  soft  palate  is  a movable  muscular  flap  at  the  very 
back  of  the  roof  of  the  mouth.  The  sounds  [k]  (cuckoo),  [g]  (goose), 
and  [ij]  (kingfisher)  arc  made  by  pressing  the  back  of  the  tongue  up 
against  the  velum. 

• glottal:  The  glottal  stop  [?]  is  made  by  closing  the  glottis  (by  bringing 
the  vocal  folds  together). 

Consonants:  Manner  of  Articulation 

Consonants  arc  also  distinguished  by  how  the  restriction  in  airflow  is  made, 
for  example  whether  there  is  a complete  stoppage  of  air,  or  only  a partial 
blockage,  etc.  This  feature  is  called  the  manner  of  articulation  of  a conso- 
nant. The  combination  of  place  and  manner  of  articulation  is  usually  suffi- 
cient to  uniquely  identify  a consonant.  Here  arc  the  major  manners  of  artic- 
ulation for  English  consonants: 

• stop:  A stop  is  a consonant  in  which  airflow  is  completely  blocked 
for  a short  time.  This  blockage  is  followed  by  an  explosive  sound  as 
the  air  is  released.  The  period  of  blockage  is  called  the  closure  and 
the  explosion  is  called  the  release.  English  has  voiced  stops  like  [b], 
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[d],  and  [g]  as  well  as  unvoiced  stops  like  [p],  [t],  and  [k].  Stops  are 
also  called  plosives.  It  is  possible  to  use  a more  narrow  (detailed)  tran- 
scription style  to  distinctly  represent  the  closure  and  release  parts  of 
a stop,  both  in  ARPAbet  and  IPA-style  transcriptions.  For  example 
the  closure  of  a [p],  [t],  or  [k]  would  be  represented  as  [pel],  [tel],  or 
[kcl]  (respectively)  in  the  ARPAbet,  and  [pn],  [l  '],  or  [k"j  (respectively) 
in  IPA  style.  When  this  form  of  narrow  transcription  is  used,  the  un- 
marked ARPABET  symbols  [p],  [t],  and  [k]  indicate  purely  the  release 
of  the  consonant.  We  will  not  be  using  this  narrow  transcription  style 
in  this  chapter. 

• nasals:  The  nasal  sounds  [n],  [m],  and  [ij]  are  made  by  lowering  the  nasals 
velum  and  allowing  air  to  pass  into  the  nasal  cavity. 

• fricative:  In  fricatives,  airflow  is  constricted  but  not  cut  off  completely,  fricative 
The  turbulent  airflow  that  results  from  the  constriction  produces  a char- 
acteristic ‘hissing’  sound.  The  English  labiodental  fricatives  [f]  and  [v] 

arc  produced  by  pressing  the  lower  lip  against  the  upper  teeth,  allow- 
ing a restricted  airflow  between  the  upper  teeth.  The  dental  fricatives 
[0]  and  [5]  allow  air  to  flow  around  the  tongue  between  the  teeth.  The 
alveolar  fricatives  [s]  and  [z]  arc  produced  with  the  tongue  against  the 
alveolar  ridge,  forcing  air  over  the  edge  of  the  teeth.  In  the  palato- 
alveolar  fricatives  [J]  and  [3]  the  tongue  is  at  the  back  of  the  alveolar 
ridge  forcing  air  through  a groove  formed  in  the  tongue.  The  higher- 
pitched  fricatives  (in  English  [s],  [z],  [J]  and  [3])  are  called  sibilants,  sibilants 
Stops  that  arc  followed  immediately  by  fricatives  arc  called  affricates; 
these  include  English  [tj]  ( chicken ) and  [d3]  {giraffe)). 

• approximant:  In  approximants,  the  two  articulators  arc  close  together  mantoxi" 
but  not  close  enough  to  cause  turbulent  airflow.  In  English  [y]  {yellow ), 

the  tongue  moves  close  to  the  roof  of  the  mouth  but  not  close  enough 
to  cause  the  turbulence  that  would  characterize  a fricative.  In  English 
[w]  {wormwood),  the  back  of  the  tongue  comes  close  to  the  velum. 
American  [r]  can  be  formed  in  at  least  two  ways;  with  just  the  tip  of 
the  tongue  extended  and  close  to  the  palate  or  with  the  whole  tongue 
bunched  up  near  the  palate.  [1]  is  formed  with  the  tip  of  the  tongue  up 
against  the  alveolar  ridge  or  the  teeth,  with  one  or  both  sides  of  the 
tongue  lowered  to  allow  air  to  flow  over  it.  [1]  is  called  a lateral  sound 
because  of  the  drop  in  the  sides  of  the  tongue. 

• tap:  A tap  or  flap  [r]  is  a quick  motion  of  the  tongue  against  the  alve-  tap 
olar  ridge.  The  consonant  in  the  middle  of  the  word  lotus  ([looms])  is 
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a tap  in  most  dialects  of  American  English;  speakers  of  many  British 
dialects  would  use  a [t]  instead  of  a tap  in  this  word. 

Vowels 

Like  consonants,  vowels  can  be  characterized  by  the  position  of  the  articu- 
lators as  they  arc  made.  The  two  most  relevant  parameters  for  vowels  arc 
what  is  called  vowel  height,  which  correlates  roughly  with  the  location  of 
the  highest  paid  of  the  tongue,  and  the  shape  of  the  lips  (rounded  or  not). 
Figure  4.6  shows  the  position  of  the  tongue  for  different  vowels. 


In  the  vowel  [i],  for  example,  the  highest  point  of  the  tongue  is  toward 
the  front  of  the  mouth.  In  the  vowel  [u],  by  contrast,  the  high-point  of  the 
tongue  is  located  tow  ard  the  back  of  the  mouth.  Vowels  in  which  the  tongue 
is  raised  toward  the  front  arc  called  front  vowels;  those  in  which  the  tongue 
is  raised  toward  the  back  arc  called  back  vowels.  Note  that  while  both  [i] 
and  [e]  arc  front  vowels,  the  tongue  is  higher  for  [i]  than  for  [e].  Vowels  in 
which  the  highest  point  of  the  tongue  is  comparatively  high  arc  called  high 
vowels;  vowels  with  mid  or  low  values  of  maximum  tongue  height  arc  called 
mid  vowels  or  low  vowels,  respectively. 

Figure  4.7  shows  a schematic  characterization  of  the  vowel  height  of 
different  vowels.  It  is  schematic  because  the  abstract  property  height  only 
correlates  roughly  with  actual  tongue  positions;  it  is  in  fact  a more  accurate 
reflection  of  acoustic  facts.  Note  that  the  chart  has  two  kinds  of  vowels: 
those  in  which  tongue  height  is  represented  as  a point  and  those  in  which  it 
is  represented  as  a vector.  A vowels  in  which  the  tongue  position  changes 
markedly  during  the  production  of  the  vowel  is  diphthong.  English  is  par- 
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low 


Figure  4.7  Qualities  of  English  vowels  (after  Ladefoged  (1993)). 


ticul arly  rich  in  diphthongs;  many  arc  written  with  two  symbols  in  the  IPA 
(for  example  the  [ei]  of  hake  or  the  [ou]  of  cobra). 

The  second  important  articulatory  dimension  for  vowels  is  the  shape 
of  the  lips.  Certain  vowels  are  pronounced  with  the  lips  rounded  (the  same 
lip  shape  used  for  whistling).  These  rounded  vowels  include  [u],  [o],  and  the 
diphthong  [ou]. 

Syllables 

Consonants  and  vowels  combine  to  make  a syllable.  There  is  no  completely 
agreed-upon  definition  of  a syllable;  roughly  speaking  a syllable  is  a vowel- 
like sound  together  with  some  of  the  surrounding  consonants  that  arc  most 
closely  associated  with  it.  The  IPA  period  symbol  [.]  is  used  to  separate 
syllables,  so  parsley  and  catnip  have  two  syllables  ([  par.sli]  and  [ kaet.mp] 
respectively),  tarragon  has  three  [ tae.ro. gan],  and  dill  has  one  ([dil]).  A syl- 
lable is  usually  described  as  having  an  optional  initial  consonant  or  set  of 
consonants  called  the  onset,  followed  by  a vowel  or  vowels,  followed  by  a 
final  consonant  or  sequence  of  consonants  called  the  coda.  Thus  d is  the 
onset  of  [dil],  while  1 is  the  coda.  The  task  of  breaking  up  a word  into  sylla- 
bles is  called  syllabification.  Although  automatic  syllabification  algorithms 
exist,  the  problem  is  hard,  partly  because  there  is  no  agreed-upon  definition 
of  syllable  boundaries.  Furthermore,  although  it  is  usually  clear  how  many 
syllables  are  in  a word,  Ladefoged  (1993)  points  out  there  arc  some  words 
(meal,  teal,  seal,  hire,  fire,  hour)  that  can  be  viewed  either  as  having  one 
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syllable  or  two. 

In  a natural  sentence  of  American  English,  certain  syllables  arc  more 
prominent  than  others.  These  arc  called  accented  syllables.  Accented  sylla- 
bles may  be  prominent  because  they  arc  louder,  they  arc  longer,  they  arc  as- 
sociated with  a pitch  movement,  or  any  combination  of  the  above.  Since  ac- 
cent plays  important  roles  in  meaning,  understanding  exactly  why  a speaker 
chooses  to  accent  a particular  syllable  is  very  complex.  But  one  important 
factor  in  accent  is  often  represented  in  pronunciation  dictionaries.  This  fac- 
tor is  called  lexical  stress.  The  syllable  that  has  lexical  stress  is  the  one  that 
will  be  louder  or  longer  if  the  word  is  accented.  For  example  the  word  pars- 
ley is  stressed  in  its  first  syllable,  not  its  second.  Thus  if  the  word  parsley 
is  accented  in  a sentence,  it  is  the  first  syllable  that  will  be  stronger.  We 
write  the  symbol  [']  before  a syllable  to  indicate  that  it  has  lexical  stress  (e.g. 
[ par.sli]).  This  difference  in  lexical  stress  can  affect  the  meaning  of  a word. 
For  example  the  word  content  can  be  a noun  or  an  adjective.  When  pro- 
nounced in  isolation  the  two  senses  are  pronounced  differently  since  they 
have  different  stressed  syllables  (the  noun  is  pronounced  [ kan.tent]  and  the 
adjective  [kan.  tent].  Other  pairs  like  this  include  object  (noun  [ ab.dgekt] 
and  verb  [ab.'dgrkt]);  see  Cutler  (1986)  for  more  examples.  Automatic  dis- 
ambiguation of  such  homographs  is  discussed  in  Chapter  17.  The  role  of 
prosody  is  taken  up  again  in  Section  4.7. 


4.2  The  Phoneme  and  Phonological  Rules 


’Sense  me,  while  / kiss  the  sky 

Jimi  Hendrix,  Purple  Haze 
’Scuse  me,  while  I kiss  this  guy 

Common  mis-hearing  of  same  lyrics 

All  [t]s  are  not  created  equally.  That  is,  phones  arc  often  produced 
differently  in  different  contexts.  For  example,  consider  the  different  pro- 
nunciations of  [t]  in  the  words  tunafish  and  starfish.  The  [t]  of  tunafish  is 
aspirated.  Aspiration  is  a period  of  voicelessness  after  a stop  closure  and 
before  the  onset  of  voicing  of  the  following  vowel.  Since  the  vocal  cords  arc 
not  vibrating,  aspiration  sounds  like  a puff  of  air  after  the  [t]  and  before  the 
unaspirated  vowel.  By  contrast,  a [t]  following  an  initial  [s]  is  unaspirated;  thus  the  [t] 
in  starfish  ([starfij])  has  no  period  of  voicelessness  after  the  [t]  closure.  This 
variation  in  the  realization  of  [t]  is  predictable:  whenever  a [t]  begins  a word 
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or  unreduced  syllable  in  English,  it  is  aspirated.  The  same  variation  occurs 
for  [k];  the  [k]  of  sky  is  often  mis-heard  as  [g]  in  Jimi  Hendrix’s  lyrics  because 
[k]  and  [g]  arc  both  unaspirated.  In  a very  detailed  transcription  system  we 
could  use  the  symbol  for  aspiration  ['']  after  any  [t]  (or  [k]  or  [p])  which  be- 
gins a word  or  unreduced  syllable.  The  word  tunafish  would  be  transcribed 
[thunofiJ]  (the  ARPAbet  does  not  have  a way  of  marking  aspiration). 

There  arc  other  contextual  valiants  of  [t].  For  example,  when  [t]  occurs 
between  two  vowels,  particularly  when  the  first  is  stressed,  it  is  pronounced 
as  a tap.  Recall  that  a tap  is  a voiced  sound  in  which  the  top  of  the  tongue 
is  curled  up  and  back  and  struck  quickly  against  the  alveolar  ridge.  Thus  the 
word  buttercup  is  usually  pronounced  [ 1 ) a rerk  a f )]/[  h uh  dx  axr  k uh  p]  rather 
than  [bAt3lkAp]/[b  uh  t axr  k uh  p]. 

Another  valiant  of  [t]  occurs  before  the  dental  consonant  [0],  Here  the 
[t]  becomes  dentalized  ([t]).  That  is,  instead  of  the  tongue  forming  a closure 
against  the  alveolar  ridge,  the  tongue  touches  the  back  of  the  teeth. 

How  do  we  represent  this  relation  between  a [t]  and  its  different  real- 
izations in  different  contexts?  We  generally  capture  this  kind  of  pronunci- 
ation variation  by  positing  an  abstract  class  called  the  phoneme,  which  is 
realized  as  different  allophones  in  different  contexts.  We  traditionally  write 
phonemes  inside  slashes.  So  in  the  above  examples,  /t / is  a phoneme  whose 
allophones  include  [th],  [r],  and  [t].  A phoneme  is  thus  a kind  of  general- 
ization or  abstraction  over  different  phonetic  realizations.  Often  we  equate 
the  phonemic  and  the  lexical  levels,  thinking  of  the  lexicon  as  containing 
transcriptions  expressed  in  terms  of  phonemes.  When  we  arc  transcribing 
the  pronunciations  of  words  we  can  choose  to  represent  them  at  this  broad 
phonemic  level;  such  a broad  transcription  leaves  out  a lot  of  predictable 
phonetic  detail.  We  can  also  choose  to  use  a narrow  transcription  that 
includes  more  detail,  including  allophonic  variation,  and  uses  the  various  di- 
acritics. Figure  4.8  summarizes  a number  of  allophones  of  /t/;  Figure  4.9 
shows  a few  of  the  most  commonly  used  IPA  diacritics. 

The  relationship  between  a phoneme  and  its  allophones  is  often  cap- 
tured by  writing  a phonological  rule.  Here  is  the  phonological  rule  for  den- 
talization  in  the  traditional  notation  of  Chomsky  and  Halle  (1968): 

/t/->  [t]/ — e (4.i) 

In  this  notation,  the  surface  allophone  appears  to  the  right  of  the  arrow, 
and  the  phonetic  environment  is  indicated  by  the  symbols  surrounding  the 

underbar  ( ).  These  rules  resemble  the  rules  of  two-level  morphology  of 

Chapter  3 but  since  they  don’t  use  multiple  types  of  rewrite  arrows,  this  rule 
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Phone 

Environment 

Example 

IPA 

Pi 

in  initial  position 

toucan 

[thukhaen] 

M 

after  [s]  or  in  reduced  syllables 

starfish 

[starfij] 

[?] 

word-finally  or  after  vowel  before  [n] 

kitten 

[khi?n] 

[?t] 

sometimes  word-finally 

cat 

[khas?t] 

M 

between  vowels 

buttercup 

[bAraikhAp] 

[tl 

before  consonants  or  word-finally 

fruitcake 

[fruLkheik] 

[t] 

before  dental  consonants  ([0]) 

eighth 

[eit0] 

D 

sometimes  word-finally 

past 

[paes] 

Figure  4.8  Some  allophones  of  /t / in  General  American  English 

is  ambiguous  between  an  obligatory  or  optional  rule.  Here  is  a version  of  the 
flapping  rule: 


Diacritics 


Suprasegmentals 


» 

Voiceless 

[a] 

1 

Primary  stress 

['pu.rno] 

Aspirated 

[Ph] 

Secondary  stress 

['fours  graef] 

Syllabic 

[1] 

: 

Long 

[a:] 

Nasalized 

[*] 

■ 

Half  long 

[a-] 

Unreleased 

T] 

Syllable  break 

['pu.ms] 

Dental 

[t] 

Figure  4.9  Some  of  the  IPA  diacritics  and  symbols  for  suprasegmentals. 


4.3  Phonological  Rules  and  Transducers 

Chapter  3 showed  that  spelling  rules  can  be  implemented  by  transducers. 
Phonological  rules  can  be  implemented  as  transducers  in  the  same  way; 
indeed  the  original  work  by  Johnson  (1972)  and  Kaplan  and  Kay  (1981) 
on  finite-state  models  was  based  on  phonological  rules  rather  than  spelling 
rules.  There  are  a number  of  different  models  of  computational  phonol- 
ogy that  use  finite  automata  in  various  ways  to  realize  phonological  rules. 
We  will  describe  the  two-level  morphology  of  Koskenniemi  (1983)  used  in 
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Chapter  3,  but  the  interested  reader  should  be  aware  of  other  recent  models.2 
While  Chapter  3 gave  examples  of  two-level  rules,  it  did  not  talk  about  the 
motivation  for  these  rules,  and  the  differences  between  traditional  ordered 
rules  and  two-level  rules.  We  will  begin  with  this  comparison. 

As  a first  example.  Figure  4.10  shows  a transducer  which  models  the 
application  of  the  simplified  flapping  rule  in  (4.3): 

/t/->[r]/V_V  (4.3) 


V:@ 


Figure  4.10  Transducer  for  English  Flapping:  ARPAbet  ‘dx’  indicates  a 
flap,  and  the  ‘other’  symbol  means  ‘any  feasible  pair  not  used  elsewhere  in  the 
transducer’.  ‘@’  means  ‘any  symbol  not  used  elsewhere  on  any  arc’. 


The  transducer  in  Figure  4.10  accepts  any  string  in  which  flaps  occur 
in  the  correct  places  (after  a stressed  vowel,  before  an  unstressed  vowel),  and 
rejects  strings  in  which  flapping  doesn’t  occur,  or  in  which  flapping  occurs 
in  the  wrong  environment.  Of  course  the  factors  that  flapping  arc  actually  a 
good  deal  more  complicated,  as  we  will  see  in  Section  5.7. 

In  a traditional  phonological  system,  many  different  phonological  rules 
apply  between  the  lexical  form  and  the  surface  form.  Sometimes  these  rules 
interact;  the  output  from  one  rule  affects  the  input  to  another  rule.  One  way 
to  implement  rule-interaction  in  a transducer  system  is  to  run  transducers  in 
a cascade.  Consider,  for  example,  the  rules  that  arc  needed  to  deal  with  the 
phonological  behavior  of  the  English  noun  plural  suffix  -s.  This  suffix  is 

2 For  example  Bird  and  Ellison’s  (1994)  model  of  the  multi-tier  representations  of  autoseg- 
mental  phonology  in  which  each  phonological  tier  is  represented  by  a finite-state  automaton, 
and  autosegmental  association  by  the  synchronization  of  two  automata. 
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pronounced  [iz]  after  the  phones  [s],  [J],  [z],  or  [3]  (so  peaches  is  pronounced 
[pitjiz],  and  faxes  is  pronounced  [faeksiz],  [z]  after  voiced  sounds  (pigs  is  pro- 
nounced [pigz]),  and  [s]  after  unvoiced  sounds  ( cats  is  pronounced  [kaets]). 
We  model  this  variation  by  writing  phonological  rules  for  the  realization  of 
the  morpheme  in  different  contexts.  We  first  need  to  choose  one  of  these 
three  forms  (s,  z,  and  iz)  as  the  ‘lexical’  pronunciation  of  the  suffix;  we 
chose  z only  because  it  turns  out  to  simplify  rule  writing.  Next  we  write  two 
phonological  rules.  One,  similar  to  the  E-insertion  spelling  rule  of  page  77, 
inserts  a [i]  after  a morpheme-final  sibilant  and  before  the  plural  morpheme 
[z].  The  other  makes  sure  that  the  -s  suffix  is  properly  realized  as  [s]  after 
unvoiced  consonants. 

e — > i / [+sibilant]  ' z # (4.4) 

z — > s / [-voice]  ' # (4.5) 

These  two  rules  must  be  ordered ; rule  (4.4)  must  apply  before  (4.5). 
This  is  because  the  environment  of  (4.4)  includes  z,  and  the  rule  (4.5)  changes 
z.  Consider  running  both  rules  on  the  lexical  form  fox  concatenated  with  the 
plural  -s\ 

Lexical  form:  faks“z 

(4.4)  applies:  faks’iz 

(4.5)  doesn  ’t  apply:  faks ' iz 

If  the  devoicing  rule  (4.5)  was  ordered  first,  we  would  get  the  wrong 
result  (what  would  this  incorrect  result  be?).  This  situation,  in  which  one 
rule  destroys  the  environment  for  another,  is  called  bleeding:3 

Lexical  form:  faks  'z 

(4.5)  applies:  faks's 

(4.4)  doesn ’t  apply:  faks ' s 

As  was  suggested  in  Chapter  3,  each  of  these  rules  can  be  represented 
by  a transducer.  Since  the  rules  are  ordered,  the  transducers  would  also  need 
to  be  ordered.  For  example  if  they  are  placed  in  a cascade,  the  output  of  the 
first  transducer  would  feed  the  input  of  the  second  transducer. 

Many  rules  can  be  cascaded  together  this  way.  As  Chapter  3 discussed, 
running  a cascade,  particularly  one  with  many  levels,  can  be  unwieldy,  and 

3 If  we  had  chosen  to  represent  the  lexical  pronunciation  of  -5  as  [s]  rather  than  [z],  we  would 
have  written  the  rule  inversely  to  voice  the  -s  after  voiced  sounds,  but  the  rules  would  still 
need  to  be  ordered;  the  ordering  would  simply  flip. 
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so  transducer  cascades  arc  usually  replaced  with  a single  more  complex 
transducer  by  composing  the  individual  transducers. 

Koskenniemi’s  method  of  two-level  morphology  that  was  sketchily 
introduced  in  Chapter  3 is  another  way  to  solve  the  problem  of  rule  ordering. 
Koskenniemi  (1983)  observed  that  most  phonological  rules  in  a grammar 
arc  independent  of  one  another;  that  feeding  and  bleeding  relations  between 
rules  arc  not  the  norm.4  Since  this  is  the  case,  Koskenniemi  proposed  that 
phonological  rules  be  run  in  parallel  rather  than  in  series.  The  cases  where 
there  is  rule  interaction  (feeding  or  bleeding)  we  deal  with  by  slightly  modi- 
fying some  rules.  Koskenniemi’s  two-level  rules  can  be  thought  of  as  a way 
of  expressing  declarative  constraints  on  the  well-formedness  of  the  lexical- 
surface  mapping. 

Two-level  rules  also  differ  from  traditional  phonological  rules  by  ex- 
plicitly coding  when  they  arc  obligatory  or  optional,  by  using  four  differing 
rule  operators;  the  rule  corresponds  to  traditional  obligatory  phonolog- 
ical rules,  while  the  =>  rule  implements  optional  rules: 


Rule  type 

Interpretation 

a : b <1=  c d 

a is  always  realized  as  b in  the  context  c d 

a :b  =4-  c d 

a may  be  realized  as  b only  in  the  context  c d 

a :b  44  c d 

a must  be  realized  as  b in  context  c d and  nowhere  else 

a : b /<=  c d 

a is  never  realized  as  b in  the  context  c d 

The  most  important  intuition  of  the  two-level  rules,  and  the  mechanism 
that  lets  them  avoiding  feeding  and  bleeding,  is  their  ability  to  represent 
constraints  on  two  levels.  This  is  based  on  the  use  of  the  colon  (‘:’),  which 
was  touched  in  very  briefly  in  Chapter  3.  The  symbol  a:b  means  a lexical 

a that  maps  to  a surface  b.  Thus  a:b  o :c means  a is  realized  as  b 

after  a surface  c.  By  contrast  a:b  44  c:  means  that  a is  realized  as  b 

after  a lexical  c.  As  discussed  in  Chapter  3,  the  symbol  c with  no  colon  is 
equivalent  to  c:c  that  means  a lexical  c which  maps  to  a surface  c. 

Figure  4. 1 1 shows  an  intuition  for  how  the  two-level  approach  avoids 
ordering  for  the  i-insertion  and  z-devoicing  rules.  The  idea  is  that  the  z- 
devoicing  rule  maps  a lexical  z-insertion  to  a surface  s and  the  i rule  refers 
to  the  lexical  z: 

The  two-level  rules  that  model  this  constraint  arc  shown  in  (4.6)  and 

(4.7): 

£ : i 44  [+sibilant] : ~ z:  # (4.6) 

4 Feeding  is  a situation  in  which  one  rules  creates  the  environment  for  another  rule  and  so 
must  be  run  beforehand. 
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rr+sib]  /a  j 

N. ^ 

i ) ) lexical  level 

[-voice]  / 

ix  ! 

5 surface  level 

Figure  4.11  The  constraints  for  the  i-insertion  and  z-devoicing  rules  both 
refer  to  a lexical  z,  not  a surface  s. 

z : s -£>  [-voice]:  ' # (4.7) 

As  Chapter  3 discussed,  there  arc  compilation  algorithms  for  creating 
automata  from  rules.  Kaplan  and  Kay  (1994)  give  the  general  derivation  of 
these  algorithms,  and  Antworth  (1990)  gives  one  that  is  specific  to  two-level 
rules.  The  automata  corresponding  to  the  two  rules  arc  shown  in  Figure  4.12 
and  Figure  4.13.  Figure  4.12  is  based  on  Figure  3.14  of  Chapter  3;  see  page 
78  for  a reminder  of  how  this  automaton  works.  Note  in  Figure  4.12  that 
the  plural  morpheme  is  represented  by  z:,  indicating  that  the  constraint  is 
expressed  about  an  lexical  rather  than  surface  z. 


Figure  4.14  shows  the  two  automata  run  in  parallel  on  the  input  [faks ' z] 
(the  figure  uses  the  ARPAbet  notation  [ f aa  k sA  z]).  Note  that  both  the  au- 
tomata assuming  the  default  mapping  ':£  to  remove  the  morpheme  boundary, 
and  that  both  automata  end  in  an  accepting  state. 
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z,  #,  other  :[-vce] 


Figure  4.13  The  transducer  for  the  z-devoicing  rule  4.5.  This  rule  might  be 
summarized  Devoice  the  morpheme  z if  it  follows  a morpheme-final  voiceless 
consonant. 


Figure  4.14  The  transducer  for  the  i-insertion  rule  4.4  and  the  z-devoicing 
rule  4.5  run  in  parallel. 


4.4  Advanced  Issues  in  Computational  Phonology 

Harmony 

Rules  like  flapping,  i-insertion,  and  z-devoicing  arc  relatively  simple  as  pho- 
nological rules  go.  In  this  section  we  turn  to  the  use  of  the  two-level  or  finite- 
state  model  of  phonology  to  model  more  sophisticated  phenomena;  this  sec- 
tion will  be  easier  to  follow  if  the  reader  has  some  knowledge  of  phonology. 

The  Yawelmani  dialect  of  Yokuts  is  a Native  America  language  spoken  in 
California  with  a complex  phonological  system.  In  particular,  there  arc  three 
phonological  rules  related  to  the  realization  of  vowels  that  had  to  be  ordered 
in  traditional  phonology,  and  whose  interaction  thus  demonstrates  a compli- 
cated use  of  finite-state  phonology.  These  rules  were  first  drawn  up  in  the 
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VOWEL 

HARMONY 


traditional  Chomsky  and  Halle  (1968)  format  by  Kisseberth  (1969)  follow- 
ing the  field  work  of  Newman  (1944). 

First,  Yokuts  (like  many  other  languages  including  for  example  Turk- 
ish and  Hungarian)  has  a phonological  phenomenon  called  vowel  harmony. 
Vowel  harmony  is  a process  in  which  a vowel  changes  its  form  to  look  like 
a neighboring  vowel.  In  Yokuts,  a suffix  vowel  changes  its  form  to  agree 
in  backness  and  roundness  with  the  preceding  stem  vowel.  That  is,  a front 
vowel  like  j\j  will  appeal-  as  a backvowel  [u]  if  the  stem  vowel  is  /u / (ex- 
amples are  taken  from  Cole  and  Kisseberth  (1995):5 


Lexical 

dub+hin 

xil+hin 

bok'+al 

xat'+al 


Surface  Gloss 

dubhun  ‘tangles,  non-future’ 

xilhin  ‘leads  by  the  hand,  non-future’ 

bok’ol  ‘might  eat’ 

xat’al  ‘might  find’ 


This  Harmony  rule  has  another  constraint:  it  only  applies  if  the  suffix 
vowel  and  the  stem  vowel  are  of  the  same  height.  Thus  /u / and  /[/  are  both 
high,  while  /o/  and  /a/  are  both  low. 

The  second  relevant  rule.  Lowering,  causes  long  high  vowels  to  be- 
come low;  thus  /u:/  becomes  [o:]  in  the  first  example  below: 


Lexical  Surface  Gloss 

?u:t'+it  — > ?o:t’ut  ‘steal,  passive  aorist’ 

mi:k’+it  — > me:k’+it  ‘swallow,  passive  aorist’ 

The  third  rule.  Shortening,  shortens  long  vowels  if  they  occur  in  closed 
syllables: 


Lexical  Surface 

s:ap+hin  — > saphin 

suduik+hin  — > sudokhun 


The  Yokuts  rules  must  be  ordered,  just  as  the  i-insertion  and  z-devoicing 
rules  had  to  be  ordered.  Harmony  must  be  ordered  before  Lowering  because 
the  /u:/  in  the  lexical  form  /?u:t:H-it / causes  the  /i/  to  become  [u]  before  it 
lowers  in  the  surface  form  [?o:t’ut].  Lowering  must  be  ordered  before  Short- 
ening because  the  /u:/  in  /suduik+hin/  lowers  to  [oj;  if  it  was  ordered  after 
shortening  it  would  appeal-  on  the  surface  as  [u]. 

Goldsmith  (1993)  and  Lakoff  (1993)  independently  observed  that  the 
Yokuts  data  could  be  modeled  by  something  like  a transducer;  Karttunen 

5 For  purposes  of  simplifying  the  explanation,  this  account  ignores  some  parts  of  the  system 
such  as  vowel  underspecification  (Archangeli,  1984). 
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(1998)  extended  the  argument,  showing  that  the  Goldsmith  and  Lakoff  con- 
straints could  be  represented  either  as  a cascade  of  3 rules  in  series,  or  in 
the  two-level  formalism  as  3 rules  in  parallel;  Figure  4.15  shows  the  two 
architectures.  Just  as  in  the  two-level  examples  presented  earlier,  the  rules 
work  by  referring  sometimes  to  the  lexical  context,  sometimes  to  the  surface 
context;  writing  the  rules  is  left  as  Exercise  4.10  for  the  reader. 


Lexical  ^ 
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t 

+ h i n 4 4 

? 
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t + h i n 4 
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Surface  \ 

? 
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a)  Cascade  of  rules.  b)  Parallel  two-level  rules. 

Figure  4.15  Combining  the  rounding,  lowering,  and  shortening  rules  for 
Yawelmani  Yokuts. 

Templatic  Morphology 

Finite-state  models  of  phonology/morphology  have  also  been  proposed  for 
the  templatic  (non-concatenative)  morphology  (discussed  on  page  60)  com- 
mon in  Semitic  languages  like  Arabic,  Hebrew,  and  Syriac.  McCarthy  (1981) 
proposed  that  this  kind  of  morphology  could  be  modeled  by  using  different 
levels  of  representation  that  Goldsmith  (1976)  had  called  tiers.  Kay  (1987)  tiers 
proposed  a computational  model  of  these  tiers  via  a special  transducer  which 
reads  four  tapes  instead  of  two,  as  in  Figure  4. 16: 

The  tricky  paid  here  is  designing  a machine  which  aligns  the  various 
strings  on  the  tapes  in  the  correct  way;  Kay  proposed  that  the  binyan  tape 
could  act  as  a sort  of  guide  for  alignment.  Kay’s  intuition  has  led  to  a number 
of  more  fully-worked-out  finite-state  models  of  Semitic  morphology  such  as 
Beesley’s  (1996)  model  for  Arabic  and  Kiraz’s  (1997)  model  for  Syriac. 

The  more  recent  work  of  Kornai  (1991)  and  Bird  and  Ellison  (1994) 
showed  how  one-tape  automata  (i.e.  finite-state  automata  rather  than  4-tape 
or  even  2-tape  transducers)  could  be  used  to  model  templatic  morphology 
and  other  kinds  of  phenomena  that  arc  handeled  with  the  tier-based  autoseg- 
mental  representations  of  Goldsmith  (1976).  yuETN°) 
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Figure  4.16  A finite-state  model  of  templatic  (‘non-concatenative’)  mor- 
phology. From  Kay  (1987). 


Optimality  Theory 

In  a traditional  phonological  derivation,  we  arc  given  given  an  underlying 
lexical  form  and  a surface  form.  The  phonological  system  then  consists 
of  one  component:  a sequence  of  rules  which  map  the  underlying  form  to 
the  surface  form.  Optimality  Theory  (OT)  (Prince  and  Smolensky,  1993) 
offers  an  alternative  way  of  viewing  phonological  derivation,  based  on  two 
functions  (GEN  and  EVAL)  and  a set  of  ranked  violable  constraints  (CON). 
Given  an  underlying  form,  the  GEN  function  produces  all  imaginable  surface 
forms,  even  those  which  couldn’t  possibly  be  a legal  surface  form  for  the 
input.  The  EVAL  function  then  applies  each  constraint  in  CON  to  these 
surface  forms  in  order  of  constraint  rank.  The  surface  form  which  best  meets 
the  constraints  is  chosen. 

A constraint  in  OT  represents  a wellformedness  constraint  on  the  sur- 
face form,  such  as  a phonotactic  constraint  on  what  segments  can  follow  each 
other,  or  a constraint  on  what  syllable  structures  arc  allowed.  A constraint 
can  also  check  how  faithful  the  surface  form  is  to  the  underlying  form. 

Let’s  turn  to  our  favorite  complicated  language,  Yawelmani,  for  an  ex- 
ample.6 In  addition  to  the  interesting  vowel  harmony  phenomena  discussed 
above,  Yawelmani  has  a phonotactic  constraints  that  rules  out  sequences  of 
consonants.  In  particular  three  consonants  in  a row  (CCC)  arc  not  allowed 
to  occur  in  a surface  word.  Sometimes,  however,  a word  contains  two  con- 
secutive morphemes  such  that  the  first  one  ends  in  two  consonants  and  the 
second  one  starts  with  one  consonant  (or  vice  versa).  What  does  the  lan- 

6 The  following  explication  of  OT  via  the  Yawelmani  example  draws  heavily  from 
Archangeli  (1997)  and  a lecture  by  Jennifer  Cole  at  the  1999  LSA  Linguistic  Institute. 
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guage  do  to  solve  this  problem?  It  turns  out  that  Yawelmani  either  deletes 
one  of  the  consonants  or  inserts  a vowel  in  between. 

For  example,  if  a stem  ends  in  a C,  and  its  suffix  stalls  with  CC,  the 
first  C of  the  suffix  is  deleted  (‘+’  here  means  a morpheme  boundary): 

C-deletion  C^e/C  + C (4.8) 

Here  is  an  example  where  the  CCVC  ‘passive  consequent  adjunctive’  mor- 
pheme lined  (actually  the  underlying  form  is  /hull/)  drops  the  initial  C if 
the  previous  morpheme  ends  in  two  consonants  (and  an  example  where  it 
doesn’t,  for  comparison): 

underlying 
morphemes  gloss 

diyel-iie:l-aw  ‘guard  - passive  consequent  adjunctive  - locative’ 
cawa-hned-aw  ‘shout  - passive  consequent  adjunctive  - locative’ 

If  a stem  ends  in  CC  and  the  suffix  stalls  with  C,  the  language  instead 
inserts  a vowel  to  break  up  the  first  two  consonants: 

Y-insertion  e ^ V / C C + C (4.9) 

Here  arc  some  examples  in  which  an  i is  inserted  into  the  roots  ?ilk-  ‘sing’ 
and  the  roots  logw-  ‘pulverize’  only  when  they  arc  followed  by  a C-initial 
suffix  like  -hin,  ‘past’,  not  a V-initial  suffix  like  -en,  ‘future’: 

surface  form  gloss 
?ilik-hin  ‘sang’ 

?ilken  ‘will  sing’ 

logiwhin  ‘pulverized’ 

logwen  ‘will  pulverize’ 

Kisseberth  (1970)  suggested  that  it  was  not  a coincidence  that  Yawel- 
mani had  these  particular  two  rules  (and  for  that  matter  other  related  deletion 
rules  that  we  haven’t  presented).  He  noticed  that  these  rules  were  function- 
ally related;  in  particular,  they  all  arc  ways  of  avoiding  3 consonants  in  a row. 

Another  way  of  stating  this  generalization  is  to  talk  about  syllable  structure. 
Yawelmani  syllables  arc  only  allowed  to  be  of  the  form  CVC  or  CV  (where 
C means  a consonant  and  V means  a vowel).  We  say  that  languages  like 
Yawelmani  don’t  allow  complex  onsets  or  complex  codas.  From  the  point  onsetex 
of  view  of  syllabification,  then,  these  insertions  and  deletions  all  happen  so  codalex 
as  to  allow  Yawelmani  words  to  be  properly  syllabified.  Since  CVCC  sylla- 
bles aren’t  allowed  on  the  surface,  CVCC  roots  must  be  resyllabified  when  f|dyllabi 
they  appeal-  on  the  surface.  For  example,  here  are  the  syllabifications  of  the 
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Yawelmani  words  we  have  discussed  and  some  others;  note,  for  example, 
that  the  surface  syllabification  of  the  CVCC  syllables  moves  the  final  conso- 
nant to  the  beginning  of  the  next  syllable: 


underlying  surface 

morphemes  syllabification 

?ilk-en  ?il.ken 

logw-en  log.  wen 

logw-hin  lo.giw.hin 

xat-en  xa.ten 

diyel-lmil-aw  di.yel.nerlaw 


gloss 


‘will  sing’ 

‘will  pulverize’ 

‘will  pulverize’ 

‘will  eat’ 

‘ask  - pass.  cons,  adjunct.  - locative’ 


Here’s  where  Optimality  Theory  comes  in.  The  basic  idea  in  Optimal- 
ity Theory  is  that  the  language  has  various  constraints  on  things  like  syllable 
structure.  These  constraints  generally  apply  to  the  surface  form  One  such 
constraint,  *COMPLEX,  says  ‘No  complex  onsets  or  codas’.  Another  class 
of  constraints  requires  the  surface  form  to  be  identical  to  (faithful  to)  the 
underlying  form.  Thus  FaithV  says  ‘Don’t  delete  or  insert  vowels’  and 
FaithC  says  ‘Don’t  delete  or  insert  consonants’.  Given  an  underlying  form, 
the  GEN  function  produces  all  possible  surface  forms  (i.e.  every  possible 
insertion  and  deletion  of  segments  with  every  possible  syllabification)  and 
they  are  ranked  by  the  EVAL  function  using  these  constraints.  Figure  4.17 
shows  the  architecture. 


The  EVAL  function  works  by  applying  each  constraint  in  ranked  order; 
the  optimal  candidate  is  one  which  either  violates  no  constraints,  or  violates 
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less  of  them  than  all  the  other  candidates.  This  evaluation  is  usually  shown 
on  a tableau  (plural  tableaux).  The  top  left-hand  cell  shows  the  input,  the  tableau 
constraints  arc  listed  in  order  of  rank  across  the  top  row,  and  the  possible 
outputs  along  the  left-most  column.  Although  there  arc  an  infinite  number 
of  candidates,  it  is  traditional  to  show  only  the  ones  which  arc  ‘close’;  in 
the  tableau  below  we  have  shown  the  output  ?ak.pid  just  to  make  it  clear 
that  even  very  different  surface  forms  arc  to  be  included.  If  a form  violates 
a constraint,  the  relevant  cell  contains  *;  a !*  indicates  the  fatal  violation  * 
which  causes  a candidate  to  be  eliminated.  Cells  for  constraints  which  arc  i* 
irrelevant  (since  a higher-level  constraint  is  already  violated)  arc  shaded. 


/?ilk-hin/ 

*COMPLEX 

FaithC 

FaithV 

? ilk.  hin 

*! 

?il.khin 

*! 

?il.hin 

*! 

tw  ?i.lik.hin 

* 

?ak.pid 

*! 

One  appeal  of  Optimality  Theoretic  derivations  is  that  the  constraints 
arc  presumed  to  be  cross-linguistic  generalizations.  That  is  all  languages  arc 
presumed  to  have  some  version  of  faithfulness,  some  preference  for  simple 
syllables,  and  so  on.  Languages  differ  in  how  they  rank  the  constraints;  thus 
English,  presumably,  ranks  FaithC  higher  than  *COMPLEX.  (How  do  we 
know  this?) 

Can  a derivation  in  Optimality  Theory  be  implemented  by  finite-state 
transducers?  Frank  and  Satta  (1999),  following  the  foundational  work  of 
Ellison  (1994),  showed  that  (1)  if  GEN  is  a regular  relation  (for  example 
assuming  the  input  doesn’t  contain  context-free  frees  of  some  sort),  and  (2) 
if  the  number  of  allowed  violations  of  any  constraint  has  some  finite  bound, 
then  an  OT  derivation  can  be  computed  by  finite-state  means.  This  second 
constraint  is  relevant  because  of  a property  of  OT  that  we  haven’t  mentioned: 
if  two  candidates  violate  exactly  the  same  number  of  constraints,  the  winning 
candidate  is  the  one  which  has  the  smallest  number  of  violations  of  the  rele- 
vant constraint. 

One  way  to  implement  OT  as  a finite-state  system  was  worked  out  by 
Karttunen  (1998),  following  the  above-mentioned  work  and  that  of  Ham- 
mond (1997).  In  Karttunen’s  model,  GEN  is  implemented  as  a finite-state 
transducer  which  is  given  an  underlying  form  and  produces  a set  of  candi- 
date forms.  For  example  for  the  syllabification  example  above,  GEN  would 
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generate  all  strings  that  are  valiants  of  the  input  with  consonant  deletions  or 
vowel  insertions,  and  their  syllabifications. 

Each  constraint  is  implemented  as  a filter  transducer  which  lets  pass 
only  strings  which  meet  the  constraint.  For  legal  strings,  the  transducer  thus 
acts  as  the  identity  mapping.  For  example,  *COMPLEX  would  be  imple- 
mented via  a transducer  that  mapped  any  input  string  to  itself,  unless  the 
input  string  had  two  consonants  in  the  onset  or  coda,  in  which  case  it  would 
be  mapped  to  null. 

The  constraints  can  then  be  placed  in  a cascade,  in  which  higher-ranked 
constraints  arc  simply  run  first,  as  suggested  in  Figure  4.18. 


1 

GEN 

^COMPLEX 

FAITHC 

FAITHV 

Figure  4.18  Version  #1  (‘merciless  cascade’)  of  Karttunen’s  finite-state 
cascade  implementation  of  OT. 

There  is  one  crucial  flaw  with  the  cascade  model  in  Figure  4.18.  Recall 
that  the  constraints-transducers  filter  out  any  candidate  which  violates  a con- 
straint. But  in  many  derivations,  include  the  proper  derivation  of  ?i  lik  bin, 
even  the  optimal  form  still  violates  a constraint.  The  cascade  in  Figure  4.17 
would  incorrectly  filter  it  out,  leaving  no  surface  form  at  all!  Frank  and  Satta 
(1999)  and  Hammond  (1997)  both  point  out  that  it  is  essential  to  only  en- 
force a constraint  if  it  does  not  reduce  the  candidate  set  to  zero.  Karttunen 
(1998)  formalizes  this  intuition  with  the  lenient  composition  operator.  Fe- 
nient  composition  is  a combination  of  regular  composition  and  an  operation 
called  priority  union.  The  basic  idea  is  that  if  any  candidates  meet  the  con- 
straint these  candidates  will  be  passed  through  the  filter  as  usual.  If  no  output 
meets  the  constraint,  lenient  composition  retains  all  of  the  candidates.  Fig- 
ure 4. 19  shows  the  general  idea;  the  interested  reader  should  see  Karttunen 
(1998)  for  the  details.  Also  see  Tesar  (1995,  1996),  Fosler  (1996),  and  Eisner 
(1997)  for  discussions  of  other  computational  issues  in  OT. 
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Figure  4.19  Version  #2  (‘lenient  cascade’)  of  Karttunen’s  finite-state  cas- 
cade implementation  of  OT,  showing  a visualization  of  the  candidate  popula- 
tions that  would  be  passed  through  each  FST  constraint. 

4.5  Machine  Learning  of  Phonological  Rules 

The  task  of  a machine  learning  system  is  to  automatically  induce  a model 
for  some  domain,  given  some  data  from  the  domain  and,  sometimes,  other 
information  as  well.  Thus  a system  to  learn  phonological  rules  would  be 
given  at  least  a set  of  (surface  forms  of)  words  to  induce  from.  A supervised 
algorithm  is  one  which  is  given  the  correct  answers  for  some  of  this  data, 
using  these  answers  to  induce  a model  which  can  generalize  to  new  data  it 
hasn’t  seen  before.  An  unsupervised  algorithm  does  this  purely  from  the 
data.  While  unsupervised  algorithms  don’t  get  to  see  the  correct  labels  for 
the  classifications,  they  can  be  given  hints  about  the  nature  of  the  rules  or 
models  they  should  be  forming.  For  example,  the  knowledge  that  the  models 
will  be  in  the  form  of  automata  is  itself  a kind  of  hint.  Such  hints  arc  called 
a learning  bias. 

This  section  gives  a very  brief  overview  of  some  models  of  unsuper- 
vised machine  learning  of  phonological  rules;  more  details  about  machine 
learning  algorithms  will  be  presented  throughout  the  book. 

Ellison  (1992)  showed  that  concepts  like  the  consonant  and  vowel  dis- 
tinction, the  syllable  structure  of  a language,  and  harmony  relationships 
could  be  learned  by  a system  based  on  choosing  the  model  from  the  set 
of  potential  models  which  is  the  simplest.  Simplicity  can  be  measured  by 
choosing  the  model  with  the  minimum  coding  length,  or  the  highest  proba- 
bility (we  will  define  these  terms  in  detail  in  Chapter  6).  Daelemans  el  al. 
(1994)  used  the  Instance-Based  Generalization  algorithm  (Aha  et  al,  1991) 
to  learn  stress  rule  for  Dutch;  the  algorithm  is  a supervised  one  which  is 
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given  a number  of  words  together  with  their  stress  patterns,  and  which  in- 
duces generalizations  about  the  mapping  from  the  sequences  of  light  and 
heavy  syllable  type  in  the  word  (light  syllables  have  no  coda  consonant; 
heavy  syllables  have  one)  to  the  stress  pattern.  Tesar  and  Smolensky  (1993) 
show  that  a system  which  is  given  Optimality  Theory  constraints  but  not 
their  ranking  can  learn  the  ranking  from  data  via  a simple  greedy  algorithm. 

Johnson  (1984)  gives  one  of  the  first  computational  algorithms  for 
phonological  rule  induction.  His  algorithm  works  for  rules  of  the  form 

(4.10)  a^b/C 

where  C is  the  feature  matrix  of  the  segments  around  a.  Johnson’s  algorithm 
sets  up  a system  of  constraint  equations  which  C must  satisfy,  by  consider- 
ing both  the  positive  contexts,  i.e.,  all  the  contexts  C,  in  which  a b occurs  on 
the  surface,  as  well  as  all  the  negative  contexts  Cj  in  which  an  a occurs  on 
the  surface.  Touretzky  et  al.  (1990)  extended  Johnson’s  insight  by  using  the 
version  spaces  algorithm  of  Mitchell  (1981)  to  induce  phonological  rules  in 
their  Many  Maps  architecture,  which  is  similar  to  two-level  phonology.  Like 
Johnson’s,  their  system  looks  at  the  underlying  and  surface  realizations  of 
single  segments.  For  each  segment,  the  system  uses  the  version  space  algo- 
rithm to  search  for  the  proper  statement  of  the  context.  The  model  also  has  a 
separate  algorithm  which  handles  harmonic  effects  by  looking  for  multiple 
segmental  changes  in  the  same  word,  and  is  more  general  than  Johnson’s  in 
dealing  with  epenthesis  and  deletion  rules. 

The  algorithm  of  Gildea  and  Jurafsky  (1996)  was  designed  to  induce 
transducers  representing  two-level  rules  of  the  type  we  have  discussed  ear- 
lier. Like  the  algorithm  of  Touretzky  et  al.  (1990),  Gildea  and  Jurafsky’s 
algorithm  was  given  sets  of  pairings  of  underlying  and  surface  forms.  The 
algorithm  was  based  on  the  OSTIA  (Oncina  et  al,  1993)  algorithm,  which  is 
a general  learning  algorithm  for  a subtype  of  finite-state  transducers  called 
subsequential  transducers.  By  itself,  the  OSTIA  algorithm  was  too  general 
to  learn  phonological  transducers,  even  given  a large  corpus  of  underlying- 
form/surface-form  pairs.  Gildea  and  Jurafsky  then  augmented  the  domain- 
independent  OSTIA  system  with  three  kinds  of  learning  biases  which  arc 
specific  to  natural  language  phonology;  the  main  two  arc  Faithfulness  (un- 
derlying segments  tend  to  be  realized  similarly  on  the  surface),  and  Com- 
munity (similar  segments  behave  similarly).  The  resulting  system  was  able 
to  learn  transducers  for  flapping  in  American  English,  or  German  consonant 
devoicing. 

Finally,  many  learning  algorithms  for  phonology  arc  probabilistic.  For 
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example  Riley  (1991)  and  Withgott  and  Chen  (1993)  proposed  a decision- 
tree  approach  to  segmental  mapping.  A decision  tree  is  induced  for  each 
segment,  classifying  possible  realizations  of  the  segment  in  terms  of  contex- 
tual factors  such  as  stress  and  the  surrounding  segments.  Decision  trees  and 
probabilistic  algorithms  in  general  will  be  defined  in  Chapter  5 and  Chap- 
ter 6. 


4.6  Mapping  Text  to  Phones  for  TTS 

Dearest  creature  in  Creation 

Studying  English  pronunciation 

I will  teach  you  in  my  verse 

Sounds  like  corpse,  corps,  horse  and  worse. 

It  will  keep  you,  Susy,  busy, 

Make  your  head  with  heat  grow  dizzy 

River,  rival;  tomb,  bomb,  comb; 

Doll  and  roll,  and  some  and  home. 

Stranger  does  not  rime  with  anger 
Neither  does  devour  with  clangour. 

G.N.  Trenite  (1870-1946)  The  Chaos,  reprinted 
in  Witten  (1982). 

Now  that  we  have  learned  the  basic  inventory  of  phones  in  English  and 
seen  how  to  model  phonological  rules,  we  arc  ready  to  study  the  problem  of 
mapping  from  an  orthographic  or  text  word  to  its  pronunciation. 


Pronunciation  dictionaries 

An  important  component  of  this  mapping  is  a pronunciation  dictionary. 
These  dictionaries  arc  actually  used  in  both  ASR  and  TTS  systems,  although 
because  of  the  different  needs  of  these  two  areas  the  contents  of  the  dictio- 
naries arc  somewhat  different. 

The  simplest  pronunciation  dictionaries  just  have  a list  of  words  and 
their  pronunciations: 
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Word 

Pronunciation 

Word 

Pronunciation 

cat 

[kaet] 

goose 

[gus] 

cats 

[kaets] 

geese 

[gis] 

Pig 

[pig] 

hedgehog 

[hed3.h3g] 

pigs 

[pigz] 

hedgehogs 

[hedg.hogz] 

fox 

[fax] 

foxes 

[fak.srz] 

Three  large,  commonly-used,  on-line  pronunciation  dictionaries  in  this 
format  arc  PRONLEX,  CMUdict,  and  CELEX.  These  arc  used  for  speech 
recognition  and  can  also  be  adapted  for  use  in  speech  synthesis.  The  PRON- 
LEX dictionary  (LDC,  1995)  was  designed  for  speech  recognition  applica- 
tions and  contains  pronunciations  for  90,694  wordforms.  It  covers  all  the 
words  used  in  many  year's  of  the  Wall  Street  Journal,  as  well  as  the  Switch- 
board Corpus.  The  CMU  Pronouncing  Dictionary  was  also  developed 
for  ASR  purposes  and  has  pronunciations  for  about  100,000  wordforms. 
The  CELEX  dictionary  (Celex,  1993)  includes  all  the  words  in  the  Oxford 
Advanced  Learner’s  Dictionary  (1974)  (41,000  lemmata)  and  the  Longman 
Dictionary  of  Contemporary  English  (1978)  (53,000  lemmata),  in  total  it  has 
pronunciations  for  160,595  wordforms.  Its  pronunciations  arc  British  while 
the  other  two  arc  American.  Each  dictionary  uses  a different  phone  set;  the 
CMU  and  PRONLEX  phonesets  arc  derived  from  the  ARPAbet,  while  the 
CELEX  dictionary  is  derived  from  the  IPA.  All  three  represent  three  levels 
of  stress:  primary  stress,  secondary  stress,  and  no  stress.  Figure  4.20  shows 
the  pronunciation  of  the  word  armadillo  in  all  three  dictionaries. 


Dictionary 

Pronunciation 

IPA  Version 

Pronlex 

CMU 

CELEX 

+arm.xdTl.o 

AA2  R M AH0  D IH1  L OWO 
”#-m@-’dI-15 

[ army  diloul 
[ qrmA'dilou] 
[m.my.'di.bu] 

Figure  4.20  The  pronunciation  of  the  word  armadillo  in  three  dictionaries. 
Rather  than  explain  special  symbols  we  have  given  an  IPA  equivalent  for  each 
pronunciation.  The  CMU  dictionary  represents  unstressed  vowels  ([a],  [i],  etc.) 
by  giving  a 0 stress  level  to  the  vowel  (we  represented  this  by  underlining  in 
the  IPA  form).  Note  the  British  r-dropping  and  use  of  the  [au]  rather  than  [ou] 
vowel  in  the  CELEX  pronunciation. 

Often  two  distinct  words  arc  spelled  the  same  (they  arc  homographs) 
but  pronounced  differently.  For  example  the  verb  wind  (‘You  need  to  wind 
this  up  more  neatly’)  is  pronounced  [warnd]  while  the  noun  wind  (‘blow, 
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blow,  thou  winter  wind’)  is  pronounced  [wind].  This  is  essential  for  TTS 
applications  (since  in  a given  context  the  system  needs  to  say  one  or  the 
other)  but  for  some  reason  is  usually  ignored  in  current  speech  recognition 
systems.  Printed  pronunciation  dictionaries  give  distinct  pronunciations  for 
each  part  of  speech;  CELEX  does  as  well.  Since  they  were  designed  for 
ASR,  Pronlex  and  CMU,  although  they  give  two  pronunciations  for  the  form 
wind , don’t  specify  which  one  is  used  for  which  paid  of  speech. 

Dictionaries  often  don’t  include  many  proper  names.  This  is  a seri- 
ous problem  for  many  applications;  Liberman  and  Church  (1992)  report  that 
21%  of  the  word  tokens  in  their  33  million  word  1988  AP  newswire  cor- 
pus were  names.  Furthermore,  they  report  that  a list  obtained  in  1987  from 
the  Donnelly  marketing  organization  contains  1.5  million  names  (covering 
72  million  households  in  the  United  States).  But  only  about  1000  of  the 
52477  lemmas  in  CELEX  (which  is  based  on  traditional  dictionaries)  arc 
proper  names.  By  contrast  Pronlex  includes  20,000  names;  this  is  still  only 
a small  fraction  of  the  1.5  million.  Very  few  dictionaries  give  pronunciations 
for  entries  like  Dr.,  which  as  Liberman  and  Church  (1992)  point  out  can  be 
“doctor”  or  “drive”,  or  2/3,  which  can  be  “two  thirds”  or  “February  third”  or 
“two  slash  three”. 

No  dictionaries  currently  have  good  models  for  the  pronunciation  of 
function  words  (and,  I,  a,  the,  of,  etc).  This  is  because  the  variation  in  these 
words  due  to  phonetic  context  is  so  great.  Usually  the  dictionaries  include 
some  simple  baseform  (such  as  [Si]  for  the  and  use  other  algorithms  to  de- 
rive the  variation  due  to  context;  Chapter  5 will  treat  the  issue  of  modeling 
contextual  pronunciation  variation  for  words  of  this  sort. 

One  significant  difference  between  TTS  and  ASR  dictionaries  is  that 
TTS  dictionaries  do  not  have  to  represent  dialectal  variation;  thus  where 
a very  accurate  ASR  dictionary  needs  to  represent  both  pronunciations  of 
either  and  tomato,  a TTS  dictionary  can  choose  one. 

Beyond  Dictionary  Lookup:  Text  Analysis 

Mapping  from  text  to  phones  relies  on  the  kind  of  pronunciation  dictionaries 
we  talked  about  in  the  last  section.  As  we  suggested  before,  one  way  to  map 
text-to-phones  would  be  to  look  up  each  word  in  a pronunciation  dictionary 
and  read  the  string  of  phones  out  of  the  dictionary.  This  method  would  work 
fine  for  any  word  that  we  can  put  in  the  dictionary  in  advance.  But  as  we 
saw  in  Chapter  3,  it’s  not  possible  to  represent  every  word  in  English  (or  any 
other  language)  in  advance.  Both  speech  synthesis  and  speech  recognition 
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systems  need  to  be  able  to  guess  at  the  pronunciation  of  words  that  arc  not 
in  their  dictionary.  This  section  will  first  examine  the  kinds  of  words  that 
arc  likely  to  be  missing  in  a pronunciation  dictionary,  and  then  show  how 
the  finite-state  transducers  of  Chapter  3 can  be  used  to  model  the  basic  task 
of  text-to-phones.  Chapter  5 will  introduce  variation  in  pronunciation  and 
introduce  probabilistic  techniques  for  modeling  it. 

Three  of  the  most  important  cases  where  we  cannot  rely  on  a word 
dictionary  involve  names,  morphological  productivity,  and  numbers.  As 
a brief  example,  we  arbitrarily  selected  a brief  (561  word)  movie  review  that 
appeared  in  today’s  issue  of  the  New  York  Times.  The  review,  of  Vincent 
Gallo’s  ’’Buffalo  ’66”,  was  written  by  Janet  Maslin.  Here’s  the  beginning  of 
the  article: 

In  Vincent  Gallo’s  “Buffalo  ’66,”  Billy  Brown  (Gallo)  steals  a 
blond  kewpie  doll  named  Layla  (Christina  Ricci)  out  of  her  tap 
dancing  class  and  browbeats  her  into  masquerading  as  his  wife  at 
a dinner  with  his  parents.  Billy  hectors,  cajoles  and  tries  to  bribe 
Layla.  (“You  can  eat  all  the  food  you  want.  Just  make  me  look 
good.”)  He  threatens  both  that  he  will  kill  her  and  that  he  won’t 
be  her  best  friend.  He  bullies  her  outrageously  but  with  such 
crazy  brio  and  jittery  persistence  that  Layla  falls  for  him.  Gallo’s 
film,  a deadpan  original  mixing  pathos  with  bravado,  works  on 
its  audience  in  much  the  same  way. 

We  then  took  two  large  commonly-used  on-line  pronunciation  dictionaries; 
the  PRONLEX  dictionary,  that  contains  pronunciations  for  90,694  word- 
forms  and  includes  coverage  of  many  year's  of  the  Wall  Street  Journal,  as 
well  as  the  Switchboard  Corpus,  and  the  larger  CELEX  dictionary,  which 
has  pronunciations  for  160,595  wordforms.  The  combined  dictionaries  have 
approximately  194,000  pronunciations.  Of  the  561  words  in  the  movie  re- 
view, 16  (3%)  did  not  have  pronunciations  in  these  two  dictionaries  (not 
counting  two  hyphenated  words,  baby-blue  and  hollow-eyed).  Here  they 
are: 


Names  Inflected  Names  Numbers  Other 


Aki  Gazzara  Gallo’s 

’66 

c’mere 

Anjelica  Kaurismaki 

indie 

Arquette  Kusturica 

kewpie 

Buscemi  Layla 

sexpot 

Gallo  Rosanna 

Some  of  these  missing  words  can  be  found  by  increasing  the  dictionary 


Section  4.6.  Mapping  Text  to  Phones  for  TTS 


123 


size  (for  example  Wells’s  (1990)  definitive  (but  not  on-line)  pronunciation 
dictionary  of  English  does  have  sexpot  and  kewpie).  But  the  rest  need  to 
generated  on-line. 

Names  arc  a large  problem  for  pronunciation  dictionaries.  It  is  diffi- 
cult or  impossible  to  list  in  advance  all  proper  names  in  English;  furthermore 
they  may  come  from  any  language,  and  may  have  variable  spellings.  Most 
potential  applications  for  TTS  or  ASR  involve  names;  for  example  names 
arc  essentially  in  telephony  applications  (directory  assistance,  call  routing). 
Corporate  names  arc  important  in  many  applications  and  arc  created  con- 
stantly ( CoComp , Intel,  Cisco).  Medical  speech  applications  (such  as  tran- 
scriptions of  doctor-patient  interviews)  require  pronunciations  of  names  of 
pharmaceuticals;  there  are  some  off-line  medical  pronunciation  dictionaries 
but  they  arc  known  to  be  extremely  inaccurate  (Markey  and  War'd,  1997). 
Recall  the  figure  of  1.5  million  names  mentioned  above,  and  Liberman  and 
Church’s  (1992)  finding  that  21%  of  the  word  tokens  in  their  33  million  word 
1988  AP  newswire  corpus  were  names. 

Morphology  is  a particular'  problem  for  many  languages  other  than  En- 
glish. For  languages  with  very  productive  morphology  it  is  computationally 
infeasible  to  represent  every  possible  word;  recall  this  Turkish  example: 

(4. 1 1)  uygaiiastiiamadiklamnizdanmissinizcasina 

uygar  +la§  +tir  +ama  +dik  +lar  +wnz 

civilized  +BEC  +CAUS  +NEGABLE  +PPART  +PL  +PlPL 
-l -dan  +mi§  +siniz  +casma 
+ABL  +PAST  +2PL  +Aslf 

‘(behaving)  as  if  you  are  among  those  whom  we  could  not 
civilize/cause  to  become  civilized’ 

Even  a language  as  similar'  to  English  as  German  has  greater  ability  to 
create  words;  Sproat  et  al.  (1998)  note  the  spontaneously  created  German  ex- 
ample Unerfindlichkeitsunterstellung  (‘allegation  of  incomprehensibility’). 

But  even  in  English,  morphologically  simple  though  it  is,  morphologi- 
cal knowledge  is  necessary  for  pronunciation  modeling.  For  example  names 
and  acronyms  are  often  inflected  ( Gallo’s , IBM’s,  DATs,  Syntex’s)  as  are 
new  words  (faxes , indies).  Furthermore,  we  can’t  just  ‘add  s’  on  to  the  pro- 
nunciation of  the  uninflected  forms,  because  as  the  last  section  showed,  the 
possessive  - ’s  and  plural  -s  suffix  in  English  are  pronounced  differently  in 
different  contexts;  Syntex’s  is  pronounced  [suilrksiz],  faxes  is  pronounced 
[fasksiz],  IBM’s  is  pronounced  [arbijemz],  and  DATs  is  pronounced  [daets] . 


124 


Chapter  4.  Computational  Phonology  and  Text-to-Speech 


Finally,  pronouncing  numbers  is  a particularly  difficult  problem.  The 
'66  in  Buffalo  '66  is  pronounced  [sikstisiks]  not  [sikssiks].  The  most  natural 
way  to  pronounce  the  phone  number  ‘947-2020’  is  probably  ‘nine’- ‘four’ - 


‘seven’  - ‘twenty  ’ - ‘twenty  ’ rather  than  ‘nine  ’ - ‘four’  - ‘seven’  - ‘two  ’ - ‘zero  ’ - ‘two  ’ - 


‘zero’.  Liberman  and  Church  (1992)  note  that  there  arc  five  main  ways  to 
pronounce  a string  of  digits  (although  others  arc  possible): 


• Serial:  each  digit  is  pronounced  separately  — 8765  is  “eight  seven  six 
five” 

• Combined:  the  digit  string  is  pronounced  as  a single  integer,  with  all 
position  labels  read  out  — “eight  thousand  seven  hundred  sixty  five” 

• Paired:  each  pair  of  digits  is  pronounced  as  an  integer;  if  there  is  an 
odd  number  of  digits  the  first  one  is  pronounced  by  itself  — “eighty- 
seven  sixty-five”. 

• Hundreds:  strings  of  four  digits  can  be  pronounced  as  counts  of  hun- 
dreds — “eighty-seven  hundred  (and)  sixty-five” 

• Trailing  Unit:  strings  than  end  in  zeros  arc  pronounced  serially  until 
the  last  nonzero  digit,  which  is  pronounced  followed  by  the  appropriate 
unit  — 8765000  is  “eight  seven  six  five  thousand”. 

Pronunciation  of  numbers  and  these  five  methods  arc  discussed  further 
in  Exercises  4.5  and  4.6. 


An  FST-based  pronunciation  lexicon 


Early  work  in  pronunciation  modeling  for  text-to-speech  systems  (such  as 
the  seminal  MITalk  system  Allen  et  al.  (1987))  relied  heavily  on  letter-to- 
sound  rules.  Each  rule  specified  how  a letter  or  combination  of  letters  was 
mapped  to  phones;  here  is  a fragment  of  such  a rule-base  from  Witten  (1982): 


Fragment  Pronunciation 


-p- 

[p] 

-ph- 

[f] 

-phe- 

[fi] 

-phes- 

[fiz] 

-place- 

[pleis] 

-placi- 

[pleisi] 

-plement- 

[ph  merit 

Such  systems  consisted  of  a long  list  of  such  rules  and  a very  small  dic- 
tionary of  exceptions  (often  function  words  such  as  a,  are,  as,  both,  do,  does, 
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etc.).  More  recent  systems  have  completely  inverted  the  algorithm,  relying 
on  very  large  dictionaries,  with  letter-to-sound  rules  only  used  for  the  small 
number  of  words  that  arc  neither  in  the  dictionary  nor  arc  morphological 
variants  of  words  in  the  dictionary.  How  can  these  large  dictionaries  be  rep- 
resented in  a way  that  allows  for  morphological  productivity?  Luckily,  these 
morphological  issues  in  pronunciation  (adding  inflectional  suffixes,  slight 
pronunciation  changes  at  the  juncture  of  two  morphemes,  etc)  arc  identical 
to  the  morphological  issues  in  spelling  that  we  saw  in  Chapter  3.  Indeed, 
(Sproat,  1998b)  and  colleagues  have  worked  out  the  use  of  transducers  for 
text-to-speech.  We  might  break  down  their  transducer  approach  into  five 
components: 

1.  an  FST  to  represent  the  pronunciation  of  individual  words  and  mor- 
phemes in  the  lexicon 

2.  FSAs  to  represent  the  possible  sequencing  of  morphemes 

3.  individual  FSTs  for  each  pronunciation  rule  (for  example  expressing 
the  pronunciation  of  -s  in  different  contexts 

4.  heuristics  and  letter-to-sound  (LTS)  rules/transducers  used  to  model 
the  pronunciations  of  names  and  acronyms 

5.  default  letter-to-sound  rules/transducers  for  any  other  unknown  words 

We  will  limit  our  discussion  here  to  the  first  four  components;  those 
interested  in  letter-to-sound  rules  should  see  (Allen  et  al.,  1987).  These  first 
components  will  turn  out  to  be  simple  extensions  of  the  FST  components 
we  saw  in  Chapter  3 and  on  page  109.  The  first  is  the  representation  of  the 
lexical  base  form  of  each  word;  recall  that  ‘base’  form  means  the  uninflected 
form  of  the  word.  The  previous  base  forms  were  stored  in  orthographic 
representation;  we  will  need  to  augment  each  of  them  with  the  correct  lexical 
phonological  representation.  Figure  4.21  shows  the  original  and  the  updated 
lexical  entries: 

The  second  paid  of  our  FST  system  is  the  finite  state  machinery  to 
model  morphology.  We  will  give  only  one  example:  the  nominal  plural  suf- 
fix -s.  Figure  4.22  in  Chapter  3 shows  the  automaton  for  English  plurals, 
updated  to  handle  pronunciation  as  well.  The  only  change  was  the  addi- 
tion of  the  [s]  pronunciation  for  the  suffix,  and  e pronunciations  for  all  the 
morphological  features. 

We  can  compose  the  inflection  FSA  in  Figure  4.22  with  a transducer 
implementing  the  baseform  lexicon  in  Figure  4.21  to  produce  an  inflectionally- 
enriched  lexicon  that  has  singular  and  plural  nouns.  The  resulting  mini- 
lexicon is  shown  in  Figure  4.23. 
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Orthographic  Lexicon 

Lexicon 

Regular  Nouns 

cat 

c k a ac  1 1 

fox 

f f o a x ks 

dog 

d d o a g g 

Irregular  Singular  Nouns 

goose 

g g oo  u s s e e 

Irregular  Plural  Nouns 

g o:e  o:e  s e 

g g oo  u:ee  i s s e £ 

Figure  4.21  FST-based  lexicon,  extending  the  lexicon  in  the  table  on  page 
74  in  Chapter  3.  Each  symbol  in  the  lexicon  is  now  a pair  of  symbols  separated 
by  one  representing  the  ‘orthographic’  lexical  entry  and  one  the  ‘phono- 
logical’ lexical  entry.  The  irregular  plural  geese  also  pre-specifies  the  contents 
of  the  intermediate  tape  ‘:ee|i’. 


Figure  4.22  FST  for  the  nominal  singular  and  plural  inflection.  The  au- 
tomaton adds  the  morphological  features  [+N],  |+PL],  and  |+SG]  at  the  lexi- 
cal level  where  relevant,  and  also  adds  the  plural  suffix  s jz  (at  the  intermediate 
level).  We  will  discuss  below  why  we  represent  the  pronunciation  of  -s  as  z 
rather  than  s. 


The  lexicon  shown  in  Figure  4.23  has  two  levels,  an  underlying  or 
‘lexical’  level  and  an  intermediate  level.  The  only  thing  that  remains  is  to  add 
transducers  which  apply  spelling  rules  and  pronunciation  rules  to  map  the 
intermediate  level  into  the  surface  level.  These  include  the  various  spelling 
rules  discussed  on  page  76  and  the  pronunciation  rules  starting  on  page  104. 

The  lexicon  and  these  phonological  rules  and  the  orthographic  rules 
from  Chapter  3 can  now  be  used  to  map  between  a lexical  representation 
(containing  both  orthographic  and  phonological  strings)  and  a surface  rep- 
resentation (containing  both  orthographic  and  phonological  strings).  As  we 
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saw  in  Chapter  3,  this  mapping  can  be  run  from  surface  to  lexical  form,  or 
from  lexical  to  surface  form;  Figure  4.24  shows  the  architecture.  Recall  that 
the  lexicon  FST  maps  between  the  ‘lexical’  level,  with  its  stems  and  mor- 
phological features,  and  an  ‘intermediate’  level  which  represents  a simple 
concatenation  of  morphemes.  Then  a host  of  FSTs,  each  representing  either 
a single  spelling  rule  constraint  or  a single  phonological  constraint,  all  run 
in  parallel  so  as  to  map  between  this  intermediate  level  and  the  surface  level. 
Each  level  has  both  orthographic  and  phonological  representations.  For  text- 
to-speech  applications  in  which  the  input  is  a lexical  form  (for  example  for 
text  generation,  where  the  system  knows  the  lexical  identity  of  the  word,  its 
part  of  speech,  its  inflection,  etc),  the  cascade  of  FSTs  can  map  from  lexi- 
cal form  to  surface  pronunciation.  For  text-to-speech  applications  in  which 
the  input  is  a surface  spelling  (for  example  for  ‘reading  text  out  loud’  ap- 
plications), the  cascade  of  FSTs  can  map  from  surface  orthographic  form  to 
surface  pronunciation  via  the  underlying  lexical  form. 

Finally  let  us  say  a few  words  about  names  and  acronyms.  Acronyms 
can  be  spelled  with  or  without  periods  ( I.R.S . or  IRS.  Acronyms  with  pe- 
riods arc  usually  pronounced  by  spelling  them  out  ([aiares]).  Acronyms 
that  usually  appeal-  without  periods  (AIDS,  ANSI,  ASCAP)  may  either  be 
spelled  out  or  pronounced  as  a word;  so  AIDS  is  usually  pronounced  the 
same  as  the  third-person  form  of  the  verb  aid.  Liberman  and  Church  (1992) 
suggest  keeping  a small  dictionary  of  the  acronyms  that  are  pronounced  as 
words,  and  spelling  out  the  rest.  Their  method  for  dealing  with  names  begins 
with  a dictionary  of  the  pronunciations  of  50,000  names,  and  then  applies  a 
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Figure  4.24  Mapping  between  the  lexicon  and  surface  form  for  orthogra- 
phy and  phonology  simultaneously.  The  system  can  be  used  to  map  from  a 
lexical  entry  to  its  surface  pronunciation  or  from  surface  orthography  to  sur- 
face pronunciation  via  the  lexical  entry. 

small  number  of  affix-stripping  rules  (akin  to  the  Porter  Stemmer  of  Chap- 
ter 3),  rhyming  heuristics,  and  letter-to-sound  rules  to  increase  the  coverage. 
Liberman  and  Church  (1992)  took  the  most  frequent  quarter  million  words 
in  the  Donnelly  list.  They  found  that  the  50,000  word  dictionary  covered 
59%  of  these  250,000  name  tokens.  Adding  stress-neutral  suffixes  like  -s, 
-ville,  and  -son  {Walters  = Walter  + s,  Abelson  = Abel  + son,  Lucasville 
= Lucas  + ville ) increased  the  coverage  to  84%.  Adding  name-name  com- 
pounds {Abdulhussein,  Baumgaertner ) and  rhyming  heuristics  increased  the 
coverage  to  89%.  (The  rhyming  heuristics  used  letter-to-sound  rules  for  the 
beginning  of  the  word  and  then  found  a rhyming  word  to  help  pronounce  the 
end;  so  Plotsky  was  pronounced  by  using  the  LTS  rule  for  PI-  and  guessing 
-otsky  from  Trotsky  They  then  added  a number  of  more  complicated  morpho- 
logical rules  (prefixes  like  O’Brien ),  stress-changing  suffixes  ( Adamovich ), 
suffix-exchanges  ( Bierstadt  = Bierbaum  - baum  + stadt)  and  used  a system 
of  letter-to-sound  rules  for  the  remainder.  This  system  was  not  implemented 
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as  an  FST;  Exercise  4.11  will  address  some  of  the  issues  in  turning  such  a 
set  of  rules  into  an  FST.  Readers  interested  in  further  details  about  names, 
acronyms  and  other  unknown  words  should  consult  sources  such  as  Liber- 
man and  Church  (1992),  Vitale  (1991),  and  Allen  et  al.  (1987). 


4.7  Prosody  in  TTS 

The  orthography  to  phone  transduction  process  just  described  produces  the 
main  component  for  the  input  to  the  paid  of  a TTS  system  which  actually 
generates  the  speech.  Another  important  paid  of  the  input  is  a specification 
of  the  prosody.  The  term  prosody  is  generally  used  to  refer  to  aspects  of  a prosody 
sentence’s  pronunciation  which  aren’t  described  by  the  sequence  of  phones 
derived  from  the  lexicon.  Prosody  operates  on  longer  linguistic  units  than 
phones,  and  hence  is  sometimes  called  the  study  of  suprasegmental  phe-  mentaleg" 
nomena. 

Phonological  Aspects  of  Prosody 

There  arc  three  main  phonological  aspects  to  prosody:  prominence,  struc-  prominence 
ture  and  tune.  structure 

As  102  discussed,  prominence  is  a broad  term  used  to  cover  stress  tune 
and  accent.  Prominence  is  a property  of  syllables,  and  is  often  described  in  stress 
a relative  manner,  by  saying  one  syllable  is  more  prominent  than  another,  accent 
Pronunciation  lexicons  mark  lexical  stress;  for  example  table  has  its  stress 
on  the  first  syllable,  while  machine  has  its  stress  on  the  second.  Function 
words  like  there,  the  or  a arc  usually  unaccented  altogether.  When  words  arc 
joined  together,  their  accentual  patterns  combine  and  form  a larger  accent 
pattern  for  the  whole  utterance.  There  arc  some  regularities  in  how  accents 
combine.  For  example  adjective-noun  combinations  like  like  new  truck  arc 
likely  to  have  accent  on  the  right  word  (new  *truck,  while  noun-noun  com- 
pounds like  *tree  surgeon  arc  likely  to  have  accent  on  the  left.  In  generally, 
however,  there  arc  many  exceptions  to  these  rules,  and  so  accent  prediction 
is  quite  complex.  For  example  the  noun-noun  compound  *apple  cake  has  the 
accent  on  the  first  word  while  the  noun-noun  compound  apple  *pie  or  city 
*hall  both  have  the  accent  on  the  second  word  (Liberman  and  Sproat,  1992; 

Sproat,  1994,  1998a).  Furthermore,  rhythm  plays  a role  in  keeping  the  ac- 
cented syllables  spread  apart  a bit;  thus  city  *hall  and  Sparking  lot  combine 
as  *city  hall  Sparking  lot  (Liberman  and  Prince,  1977).  Finally,  the  location 
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of  accent  is  very  strongly  affected  by  the  discourse  factors  we  will  describe 
in  Chapter  18  and  Chapter  19;  in  particular  new  or  focused  words  or  phrases 
often  receive  accent. 

Sentences  have  prosodic  structure  in  the  sense  that  some  words  seem  to 
group  naturally  together  and  some  words  seem  to  have  a noticeable  break  or 
disjuncture  between  them.  Often  prosodic  structure  is  described  in  terms  of 
prosodic  phrasing,  meaning  that  an  utterance  has  a prosodic  phrase  struc- 
ture in  a similar  way  to  it  having  a syntactic  phrase  structure.  For  example,  in 
the  sentence  I wanted  to  go  to  London,  but  could  only  get  tickets  for  France 
there  seems  to  be  two  main  prosodic  phrases,  their  boundary  occurring  at  the 
comma.  Commonly  used  terms  for  these  larger  prosodic  units  include  into- 
national  phrase  or  IP  (Beckman  and  Pierrehumbert,  1986),  intonation  unit 
(Du  Bois  et  al.,  1983),  and  tone  unit  (Crystal,  1969).  Furthermore,  in  the 
first  phrase,  there  seems  to  be  another  set  of  lesser  prosodic  phrase  bound- 
aides  (often  called  intermediate  phrases)  that  split  up  the  words  as  follows 
I wanted  j to  go  j to  London.  The  exact  definitions  of  prosodic  phrases 
and  subphrases  and  their  relation  to  syntactic  phrases  like  clauses  and  noun 
phrases  and  semantic  units  have  been  and  still  arc  the  topic  of  much  debate 
(Chomsky  and  Halle,  1968;  Langendoen,  1975;  Streeter,  1978;  Hirschberg 
and  Pierrehumbert,  1986;  Selkirk,  1986;  Nespor  and  Vogel,  1986;  Croft, 
1995;  Ladd,  1996;  Ford  and  Thompson,  1996;  Ford  et  al,  1996).  Despite 
these  complications,  algorithms  have  been  proposed  which  attempt  to  au- 
tomatically break  an  input  text  sentence  into  intonational  phrases.  For  ex- 
ample Wang  and  Hirschberg  (1992),  Ostendorf  and  Veilleux  (1994),  Tay- 
lor and  Black  (1998),  and  others  have  built  statistical  models  (incorporating 
probabilistic  predictors  such  as  the  CART-style  decision  trees  to  be  defined 
in  Chapter  5)  for  predicting  intonational  phrase  boundaries  based  on  such 
features  as  the  parts  of  speech  of  the  surrounding  words,  the  length  of  the 
utterance  in  words  and  seconds,  the  distance  of  the  potential  boundary  from 
the  beginning  or  ending  of  the  utterance,  and  whether  the  surrounding  words 
arc  accented. 

Two  utterances  with  the  same  prominence  and  phrasing  patterns  can 
still  differ  prosodically  by  having  different  tunes.  Tune  refers  to  the  into- 
national melody  of  an  utterance.  Consider  the  utterance  oh,  really.  Without 
varying  the  phrasing  or  stress,  it  is  still  possible  to  have  many  valiants  of 
this  by  varying  the  intonational  tune.  For  example,  we  might  have  an  ex- 
cited version  oh,  really!  (in  the  context  of  a reply  to  a statement  that  you’ve 
just  won  the  lottery);  a sceptical  version  oh,  really?  — in  the  context  of  not 
being  sure  that  the  speaker  is  being  honest;  to  an  angry  oh,  really!  indicat- 
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ing  displeasure.  Intonational  tunes  can  be  broken  into  component  parts,  the 
most  important  of  which  is  the  pitch  accent.  Pitch  accents  occur  on  stressed  accent 
syllables  and  form  a characteristic  pattern  in  the  FO  contour  (as  explained  be- 
low). Depending  on  the  type  of  pattern,  different  effects  (such  as  those  just 
outlined  above)  can  be  produced.  A popular  model  of  pitch  accent  classifi- 
cation is  the  Pierrehumbert  or  ToBI  model  (Pierrehumbert,  1980;  Silverman 
et  al. , 1992),  which  says  there  arc  5 pitch  accents  in  English,  which  arc  made 
from  combining  two  simple  tones  (high  H,  and  low  L)  in  various  ways.  A 
H+L  pattern  forms  a fall,  while  a L+H  pattern  forms  a rise.  An  asterisk  (*) 
is  also  used  to  indicate  which  tone  falls  on  the  stressed  syllable.  This  gives 
an  inventory  of  H*,  L*,  L+H*,  L*+H,  H+L*  (a  sixth  pitch  accent  H*+L 
which  was  present  in  early  versions  of  the  model  was  later  abandoned).  Our 
three  examples  of  oh,  really  might  be  marked  with  the  accents  L+H*,  L*+H 
and  L*  respectively.  In  addition  to  pitch  accents,  this  model  also  has  two 
phrase  accents  L-  and  H-  and  two  boundary  tones  L%  and  H%,  which  are 
used  at  the  ends  of  phrases  to  control  whether  the  intonational  tune  rises  or 
falls. 

Other  intonational  modals  differ  from  ToBI  by  not  using  discrete  phone- 
mic classes  for  intonation  accents.  For  example  the  Tilt  (Taylor,  2000)  and 
Fujisaki  models  (Fujisaki  and  Ohno,  1997)  use  continuous  parameters  rather 
than  discrete  categories  to  model  pitch  accents.  These  researchers  argue  that 
while  the  discrete  models  are  often  easier  to  visualize  and  work  with,  con- 
tinuous models  may  be  more  robust  and  more  accurate  for  computational 
purposes. 

Phonetic  or  Acoustic  Aspects  of  Prosody 

The  three  phonological  factors  interact  and  arc  realized  by  a number  of  dif- 
ferent phonetic  or  acoustic  phenomena.  Prominent  syllables  arc  generally 
louder  and  longer  that  non-prominent  syllables.  Prosodic  phrase  boundaries 
arc  often  accompanied  by  pauses,  by  lengthening  of  the  syllable  just  before 
the  boundary,  and  sometimes  lowering  of  pitch  at  the  boundary.  Intonational 
tune  is  manifested  in  the  fundamental  frequency  (FO)  contour. 

Prosody  in  Speech  Synthesis 

A major  task  for  a TTS  system  is  to  generate  appropriate  linguistic  repre- 
sentations of  prosody,  and  from  them  generate  appropriate  acoustic  patterns 
which  will  be  manifested  in  the  output  speech  waveform.  The  output  of 
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a TTS  system  with  such  a prosodic  component  is  a sequence  of  phones, 
each  of  which  has  a duration  and  an  FO  (pitch)  value.  The  duration  of  each 
phone  is  dependent  on  the  phonetic  context  (see  Chapter  7).  The  FO  value 
is  influenced  by  the  factors  discussed  above,  including  the  lexical  stress,  the 
accented  or  focused  element  in  the  sentence,  and  the  intonational  tune  of  the 
utterance  (for  example  a final  rise  for  questions).  Figure  4.25  shows  some 
sample  TTS  output  from  the  FESTIVAL  (Black  et  ah,  1999)  speech  synthe- 
sis system  for  the  sentence  Do  you  really  want  to  see  all  of  it?.  This  output, 
together  with  the  FO  values  shown  in  Figure  4.26  would  be  the  input  to  the 
waveform  synthesis  component  described  in  Chapter  7.  The  durations  here 
arc  computed  by  a CART-style  decision  tree  (Riley,  1992). 
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Figure  4.25  Output  of  the  FESTIVAL  (Black  et  al.,  1999)  generator  for  the  sentence  Do 
you  really  want  to  see  all  of  it?.  The  exact  intonation  countour  is  shown  in  Figure  4.26. 


As  was  suggested  above,  determining  the  proper  prosodic  pattern  for 
a sentence  is  difficult,  as  real-world  knowledge  and  semantic  information  is 
needed  to  know  which  syllables  to  accent,  and  which  tune  to  apply.  This  sort 
of  information  is  difficult  to  extract  from  the  text  and  hence  prosody  modules 
often  aim  to  produce  a “neutral  declarative”  version  of  the  input  text,  which 
assume  the  sentence  should  be  spoken  in  a default  way  with  no  reference  to 
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discourse  history  or  real-world  events.  This  is  one  of  the  main  reasons  why 
intonation  in  TTS  often  sounds  “wooden”. 

4.8  Human  Processing  of  Phonology  and  Morphology 

Chapter  3 suggested  that  productive  morphology  plays  a psychologically  real 
role  in  the  human  lexicon.  But  we  stopped  short  of  a detailed  model  of  how 
the  morphology  might  be  represented.  Now  that  we  have  studied  phono- 
logical structure  and  phonological  learning,  we  return  to  the  psychological 
question  of  the  representation  of  morphological/phonological  knowledge. 

One  view  of  human  morphological  or  phonological  processing  might 
be  that  it  distinguishes  productive,  regular  morphology  from  irregular  or  ex- 
ceptional morphology.  Under  this  view,  the  regular  past  tense  morpheme 
-ed,  for  example,  could  be  mentally  represented  as  a rule  which  would  be 
applied  to  verbs  like  walk  to  produce  walked.  Irregular  past  tense  verbs  like 
broke , sang,  and  brought,  on  the  other  hand,  would  simply  be  stored  as  paid 
of  a lexical  representation,  and  the  rule  wouldn't  apply  to  these.  Thus  this 
proposal  strongly  distinguishes  representation  via  rules  from  representation 
via  lexical  listing. 

This  proposal  seems  sensible,  and  is  indeed  identical  to  the  transducer- 
based  models  we  have  presented  in  these  last  two  chapters.  Unfortunately, 
this  simple  model  seems  to  be  wrong.  One  problem  is  that  the  irregular  verbs 
themselves  show  a good  deal  of  phonological  subregularity.  For  example,  laritygu" 
the  i/as  alternation  relating  ring  and  rang  also  relates  sing  and  sang  and  swim 
and  swam  (Bybee  and  Slobin,  1982).  Children  learning  the  language  of- 
ten extend  this  pattern  to  incorrectly  produce  bring-brang,  and  adults  often 
make  speech  errors  showing  effects  of  this  subregular  pattern.  A second 
problem  is  that  there  is  psychological  evidence  that  high-frequency  regular 
inflected  forms  {needed,  covered ) arc  stored  in  the  lexicon  just  like  the  stems 
cover  and  need  (Losiewicz,  1992).  Finally,  word  and  morpheme  frequency 
in  general  seems  to  play  an  important  role  in  human  processing. 

Arguments  like  these  led  to  ‘data-driven’  models  of  morphological 
learning  and  representation,  which  essentially  store  all  the  inflected  forms 
they  have  seen.  These  models  generalize  to  new  forms  by  a kind  of  analogy; 
regular  morphology  is  just  like  subregular  morphology  but  acquires  rule-like 
trappings  simply  because  it  occurs  more  often.  Such  models  include  the 
computational  connectionist  or  Parallel  Distributed  Processing  model  of  jionist0 

PARALLEL 

Runic  I hait  and  McClelland  (1986)  and  subsequent  improvements  (Plunkett  distributed 
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and  Marchman,  1991;  MacWhinney  and  Leinbach,  1991)  and  the  similar 
network  model  of  Bybee  (1985,  1995).  In  these  models,  the  behavior  of 
regular  morphemes  like  -ed  emerges  from  its  frequent  interaction  with  other 
forms.  Proponents  of  the  rule-based  view  of  morphology  such  as  Pinker 
and  Prince  (1988),  Marcus  el  al.  (1995),  and  others,  have  criticized  the  con- 
nectionist  models  and  proposed  a compromise  dual  processing  model,  in 
which  regular  forms  like  -ed  are  represent  as  symbolic  rules,  but  subregular 
examples  {broke,  brought)  arc  represented  by  connectionist-style  pattern  as- 
sociators.  This  debate  between  the  connectionist  and  dual  processing  models 
has  deep  implications  for  mental  representation  of  all  kinds  of  regular  rule- 
based  behavior  and  is  one  of  the  most  interesting  open  questions  in  human 
language  processing.  Chapter  7 will  briefly  discuss  connectionist  models  of 
human  speech  processing;  readers  who  arc  further  interested  in  connection- 
ist models  should  consult  the  references  above  and  textbooks  like  Anderson 
(1995). 

4.9  Summary 

This  chapter  has  introduced  many  of  the  important  notions  we  need  to  un- 
derstand spoken  language  processing.  The  main  points  arc  as  follows: 

• We  can  represent  the  pronunciation  of  words  in  terms  of  units  called 
phones.  The  standard  system  for  representing  phones  is  the  Interna- 
tional Phonetic  Alphabet  or  IPA.  An  alternative  English-only  tran- 
scription system  that  uses  ASCII  letters  is  the  ARPAbet. 

• Phones  can  be  described  by  how  they  arc  produced  articulatorily  by 
the  vocal  organs;  consonants  arc  defined  in  terms  of  their  place  and 
manner  of  articulation  and  voicing,  vowels  by  their  height  and  back- 
ness. 

• A phoneme  is  a generalization  or  abstraction  over  different  phonetic 
realizations.  Allophonic  rules  express  how  a phoneme  is  realized  in  a 
given  context. 

• Transducers  can  be  used  to  model  phonological  rules  just  as  they  were 
used  in  Chapter  3 to  model  spelling  rules.  Two-level  morphology  is 
a theory  of  morphology/phonology  which  models  phonological  rules 
as  finite-state  well-formedness  constraints  on  the  mapping  between 
lexical  and  surface  form. 
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• Pronunciation  dictionaries  are  used  for  both  text-to-speech  and  au- 
tomatic speech  recognition.  They  give  the  pronunciation  of  words  as 
strings  of  phones,  sometimes  including  syllabification  and  stress.  Most 
on-line  pronunciation  dictionaries  have  on  the  order  of  100,000  words 
but  still  lack  many  names,  acronyms,  and  inflected  forms. 

• The  text-analysis  component  of  a text-to-speech  system  maps  from 
orthography  to  strings  of  phones.  This  is  usually  done  with  a large 
dictionary  augmented  with  a system  (such  as  a transducer)  for  handling 
productive  morphology,  pronunciation  changes,  names,  numbers,  and 
acronyms. 


Bibliographical  and  Historical  Notes 

The  major  insights  of  articulatory  phonetics  date  to  the  linguists  of  800-150 
B.C.  India.  They  invented  the  concepts  of  place  and  manner  of  articulation, 
worked  out  the  glottal  mechanism  of  voicing,  and  understood  the  concept  of 
assimilation.  European  science  did  not  catch  up  with  the  Indian  phoneticians 
until  over  2000  years  later,  in  the  late  19th  century.  The  Greeks  did  have 
some  rudimentary  phonetic  knowledge;  by  the  time  of  Plato’s  Thecietetus  and 
Cratylus , for  example,  distinguished  vowels  from  consonants,  and  stop  con- 
sonants from  continuants.  The  Stoics  developed  the  idea  of  the  syllable  and 
were  aware  of  phonotactic  constraints  on  possible  words.  An  unknown  Ice- 
landic scholar  of  the  twelfth  century  exploited  the  concept  of  the  phoneme, 
proposed  a phonemic  writing  system  for  Icelandic,  including  diacritics  for 
length  and  nasality.  But  his  text  remained  unpublished  until  1818  and  even 
then  was  largely  unknown  outside  Scandinavia  (Robins,  1967).  The  modern 
era  of  phonetics  is  usually  said  to  have  begun  with  (1877),  who  proposed 
what  is  essentially  the  phoneme  in  his  Handbook  of  Phonetics  (1877).  He 
also  devised  an  alphabet  for  transcription  and  distinguished  between  broad 
and  narrow  transcription,  proposing  many  ideas  that  were  eventually  incor- 
porated into  the  IPA.  Sweet  was  considered  the  best  practicing  phonetician 
of  his  time;  he  made  the  first  scientific  recordings  of  languages  for  phonetic 
purposes,  and  advanced  the  start  of  the  art  of  articulatory  description.  He 
was  also  infamously  difficult  to  get  along  with,  a trait  that  is  well  captured 
in  the  stage  character  that  George  Bernard  Shaw  modeled  after  him:  Henry 
Higgins.  The  phoneme  was  first  named  by  the  Polish  scholar  Baudouin  de 
Courtenay,  who  published  his  theories  in  1 894. 
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The  idea  that  phonological  rules  could  be  modeled  as  regular  rela- 
tions dates  to  Johnson  (1972),  who  showed  that  any  phonological  system 
that  didn’t  allow  rules  to  apply  to  their  own  output  (i.e.  systems  that  did  not 
have  recursive  rules)  could  be  modeled  with  regular  relations  (or  finite-state 
transducers).  Virtually  all  phonological  rules  that  had  been  formulated  at 
the  time  had  this  property  (except  some  rules  with  integral-valued  features, 
like  early  stress  and  tone  rules).  Johnson’s  insight  unfortunately  did  not  at- 
tract the  attention  of  the  community,  and  was  independently  discovered  by 
Roland  Kaplan  and  Martin  Kay;  see  Chapter  3 for  the  rest  of  the  history  of 
two-level  morphology.  Karttunen  (1993)  gives  a tutorial  introduction  to  two- 
level  morphology  which  includes  more  of  the  advanced  details  than  we  were 
able  to  present  here. 

Readers  interested  in  phonology  should  consult  (Goldsmith,  1995)  as  a 
reference  on  phonological  theory  in  general  and  Archangeli  and  Langendoen 
(1997)  on  Optimality  Theory. 

Two  classic  text-to-speech  synthesis  systems  arc  described  in  Allen 
et  cil.  (1987)  (the  MITalk  system)  and  Sproat  (1998b)  (the  Bell  Labs  sys- 
tem). The  pronunciation  problem  in  text-to-speech  synthesis  is  an  ongoing 
research  area;  much  of  the  current  research  focuses  on  prosody.  Interested 
readers  should  consult  the  proceedings  of  the  main  speech  engineering  con- 
ferences: ICSLP  (the  International  Conference  on  Spoken  Language  Pro- 
cessing), IEEE  ICASSP  (the  International  Conference  on  Acoustics,  Speech, 
and  Signal  Processing),  and  EUROSPEECH. 

Students  with  further  interest  in  transcription  and  articulatory  phonet- 
ics should  consult  an  introductory  phonetics  textbook  such  as  Ladefoged 
(1993).  Pullum  and  Ladusaw  (1996)  is  a comprehensive  guide  to  each  of  the 
symbols  and  diacritics  of  the  IPA.  Many  phonetics  papers  of  computational 
interest  are  to  be  found  in  the  Journal  of  the  Acoustical  Society  of  America 
(JASA),  Computer  Speech  and  Language,  and  and  Speech  Communication. 


Exercises 

4.1  Find  the  mistakes  in  the  IPA  transcriptions  of  the  following  words: 
a.  “three”  [5ri] 
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b.  “sing”  [sing] 

c.  “eyes”  [ais] 

d.  “study”  [studi] 

e.  “though”  [0ou] 

f.  “planning”  [plamq] 

g.  “slight”  [slit] 

4.2  Translate  the  pronunciations  of  the  following  color  words  from  the  IPA 
into  the  ARPAbet  (and  make  a note  if  you  think  you  pronounce  them  differ- 
ently than  this!) 

a.  [red] 

b.  [blu] 

c.  [grin] 

d.  [jelou] 

e.  [blask] 

f.  [wait] 

g.  [ 3rind3] 

h.  ['pypl] 

i.  [pjus] 

j.  [toup] 

4.3  Transcribe  Ira  Gershwin’s  two  pronunciations  of  ‘either’  in  IPA  and  in 
the  ARPAbet. 

4.4  Transcribe  the  following  words  in  both  the  ARPAbet  and  the  IPA. 

a.  dark 

b.  suit 

c.  greasy 

d.  wash 

e.  water 

4.5  Write  an  FST  which  correctly  pronounces  strings  of  dollar  amounts 
like  $45,  $320,  and  $4100.  If  there  are  multiple  ways  to  pronounce  a number 
you  may  pick  your  favorite  way. 

4.6  Write  an  FST  which  correctly  pronounces  7-digit  phone  numbers  like 
555-1212,  555-1300,  and  so  on.  You  should  use  a combination  of  the  paired 
and  trailing  unit  methods  of  pronunciation  for  the  last  four  digits. 
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CANADIAN 

RAISING 


4.7  Build  an  automaton  for  rule  (4.5). 


4.8  One  difference  between  one  dialect  of  Canadian  English  and  most  di- 
alects of  American  English  is  called  Canadian  raising.  (Bromberger  and 
Halle,  1989)  note  that  some  Canadian  dialects  of  English  raise  /ai/  to  [ai] 
and  /au/  to  [aa]  before  a voiceless  consonant.  A simplified  version  of  the 
rule  dealing  only  with  /ai/  can  be  stated  as: 


/ai/ 


[All/  — 


c 

—voice 


(4.12) 


This  rule  has  an  interesting  interaction  with  the  flapping  rule.  In  some 
Canadian  dialects  the  word  rider  and  writer  arc  pronounced  differently:  rider 
is  pronounced  [raira1]  while  writer  is  pronounced  [rAira1].  Write  a two-level 
rule  and  an  automaton  for  both  the  raising  rule  and  the  flapping  rule  which 
correctly  models  this  distinction.  You  may  make  simplifying  assumptions  as 
needed. 


4.9  Write  the  lexical  entry  for  the  pronunciation  of  the  English  past  tense 
(preterite)  suffix  -d,  and  the  two  level-rules  that  express  the  difference  in  its 
pronunciation  depending  on  the  previous  context.  Don’t  worry  about  the 
spelling  rules.  (Hint:  make  sure  you  correctly  handle  the  pronunciation  of 
the  past  tenses  of  the  words  add,  pat,  bake,  and  bag.) 

4.10  Write  two-level  rules  for  the  Yawelmani  Yokuts  phenomena  of  Har- 
mony, Shortening,  and  Lowering  introduced  on  page  110.  Make  sure  your 
rules  are  capable  of  running  in  parallel. 

4.11  Find  10  stress-neutral  name  suffixes  (look  in  a phone  book)  and  sketch 
an  FST  which  would  model  the  pronunciation  of  names  with  or  without  suf- 
fixes. 


PROBABILISTIC  MODELS 
OF  PRONUNCIATION 
AND  SPELLING 


ALGERNON:  But  my  own  sweet  Cecily,  1 have  never  written  you 
any  letters. 

CECILY:  You  need  hardly  remind  me  of  that,  Ernest.  I remember 
only  too  well  that  I was  forced  to  write  your  letters  for  you.  I 
wrote  always  three  times  a week,  and  sometimes  oftener. 
ALGERNON:  Oh,  do  let  me  read  them,  Cecily? 

CECILY:  Oh,  I couldn ’t  possibly.  They  would  make  you  far  too 
conceited.  The  three  you  wrote  me  after  I had  broken  off  the  en- 
gagement are  so  beautiful,  and  so  badly  spelled,  that  even  now  1 
can  hardly  read  them  without  crying  a little. 

Oscar  Wilde,  The  Importance  of  being  Ernest 


Like  Oscar  Wilde’s  Cecily,  the  characters  in  Gilbert  and  Sullivan’s  op- 
erettas also  seem  somewhat  anxious  about  spelling.  The  Gondoliers'  Giu- 
seppe worries  that  his  private  secretary  is  ‘shaky  in  his  spelling’  while  lolan- 
the's  Phyllis  can  ‘spell  every  word  that  she  uses’.  While  an  investigation  into 
the  role  of  proper  spelling  in  class  identification  at  the  turn-of-the-century 
would  take  us  too  far  afield  (although  see  Veblen  (1889)),  we  can  certainly 
agree  that  many  more  of  us  arc  like  Cecily  than  like  Phyllis.  Estimates  for 
the  frequency  of  spelling  errors  in  human  typed  text  vary  from  0.05%  of  the 
words  in  carefully  edited  newswire  text  to  38%  in  difficult  applications  like 
telephone  directory  lookup  (Kukich,  1992). 

In  this  chapter  we  discuss  the  problem  of  detecting  and  correcting 
spelling  errors  and  the  very  related  problem  of  modeling  pronunciation  vari- 
ation for  automatic  speech  recognition  and  text-to-speech  systems.  On  the 
surface,  the  problems  of  finding  spelling  errors  in  text  and  modeling  the  vari- 
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able  pronunciation  of  words  in  spoken  language  don’t  seem  to  have  much 
in  common.  But  the  problems  turn  out  to  be  isomorphic  in  an  important 
way:  they  can  both  be  viewed  as  problems  of  probabilistic  transduction.  For 
speech  recognition,  given  a string  of  symbols  representing  the  pronunciation 
of  a word  in  context,  we  need  to  figure  out  the  string  of  symbols  represent- 
ing the  lexical  or  dictionary  pronunciation,  so  we  can  look  the  word  up  in  the 
dictionary.  But  any  given  surface  pronunciation  is  ambiguous;  it  might  corre- 
spond to  different  possible  words.  For  example  the  ARPAbet  pronunciation 
[er]  could  correspond  to  reduced  forms  of  the  words  her,  were,  are,  their, 
or  your.  This  ambiguity  problem  is  heightened  by  pronunciation  varia- 
tion; for  example  the  word  the  is  sometimes  pronounced  THEE  and  some- 
times THUH;  the  word  because  sometimes  appears  as  because,  sometimes 
as  ’cause.  Some  aspects  of  this  variation  arc  systematic;  Section  5.7  will  sur- 
vey the  important  kinds  of  variation  in  pronunciation  that  arc  important  for 
speech  recognition  and  text-to-speech,  and  present  some  preliminary  rules 
describing  this  variation.  High-quality  speech  synthesis  algorithms  need  to 
know  when  to  use  particular  pronunciation  variants.  Solving  both  speech 
tasks  requires  extending  the  transduction  between  surface  phones  and  lexi- 
cal phones  discussed  in  Chapter  4 with  probabilistic  variation. 

Similarly,  given  the  sequence  of  letters  corresponding  to  a mis-spelled 
word,  we  need  to  produce  an  ordered  list  of  possible  correct  words.  For 
example  the  sequence  acress  might  be  a mis-spelling  of  actress,  or  of  cress, 
or  of  acres.  We  transduce  from  the  ‘surface’  form  acress  to  the  various 
possible  ‘lexical’  forms,  assigning  each  with  a probability;  we  then  select 
the  most  probable  correct  word. 

In  this  chapter  we  first  introduce  the  problems  of  detecting  and  correct- 
ing spelling  errors,  and  also  summarize  typical  human  spelling  error  patterns. 
We  then  introduce  the  essential  probabilistic  architecture  that  we  will  use  to 
solve  both  spelling  and  pronunciation  problems:  the  Bayes  Rule  and  the 
noisy  channel  model.  The  Bayes  rule  and  its  application  to  the  noisy  chan- 
nel model  will  play  a role  in  many  problems  throughout  the  book,  particu- 
larly in  speech  recognition  (Chapter  7),  part-of-speech  tagging  (Chapter  8), 
and  probabilistic  parsing  (Chapter  12). 

The  Bayes  Rule  and  the  noisy  channel  model  provide  the  probabilistic 
framework  for  these  problems.  But  actually  solving  them  requires  an  algo- 
rithm. This  chapter  introduces  an  essential  algorithm  called  the  dynamic 
programming  algorithm,  and  various  instantiations  including  the  Viterbi 
algorithm,  the  minimum  edit  distance  algorithm,  and  the  forward  algo- 
rithm. We  will  also  see  the  use  of  a probabilistic  version  of  the  finite-state 
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automaton  called  the  weighted  automaton. 


5.1  Dealing  with  Spelling  Errors 

The  detection  and  correction  of  spelling  errors  is  an  integral  paid  of  modern 
word-processors.  The  very  same  algorithms  arc  also  important  in  applica- 
tions in  which  even  the  individual  letters  aren't  guaranteed  to  be  accurately 
identified:  optical  character  recognition  (OCR)  and  on-line  handwriting  ocr 
recognition.  Optical  character  recognition  is  the  term  used  for  automatic 
recognition  of  machine  or  hand-printed  characters.  An  optical  scanner  con- 
verts a machine  or  hand-printed  page  into  a bitmap  which  is  then  passed  to 
an  OCR  algorithm. 

On-line  handwriting  recognition  is  the  recognition  of  human  printed 
or  cursive  handwriting  as  the  user  is  writing.  Unlike  OCR  analysis  of  hand- 
writing, algorithms  for  on-line  handwriting  recognition  can  take  advantage 
of  dynamic  information  about  the  input  such  as  the  number  and  order  of 
the  strokes,  and  the  speed  and  direction  of  each  stroke.  On-line  handwrit- 
ing recognition  is  important  where  keyboards  arc  inappropriate,  such  as  in 
small  computing  environments  (palm-pilot  applications,  etc)  or  in  scripts 
like  Chinese  that  have  large  numbers  of  written  symbols,  making  keyboards 
cumbersome. 

In  this  chapter  we  will  focus  on  detection  and  correction  of  spelling 
errors,  mainly  in  typed  text,  but  the  algorithms  will  apply  also  to  OCR  and 
handwriting  applications.  OCR  systems  have  even  higher  error  rates  than 
human  typists,  although  they  tend  to  make  different  errors  than  typists.  For 
example  OCR  systems  often  misread  ‘D’  as  ‘O’  or  Ti’  as  ‘n’,  producing 
‘mis-spelled’  words  like  dension  for  derision , or  POQ  Bach  for  PDQ  Bach. 

The  reader  with  further  interest  in  handwriting  recognition  should  consult 
sources  such  as  Tappert  et  al.  (1990),  Hu  et  al.  (1996),  and  Casey  and  Leco- 
linet  (1996). 

Kukich  (1992),  in  her  survey  article  on  spelling  correction,  breaks  the 
field  down  into  three  increasingly  broader  problems: 

1.  non-word  error  detection:  detecting  spelling  errors  which  result  in 
non-words  (like  graffe  for  giraffe). 

2.  isolated-word  error  correction:  correcting  spelling  errors  which  re- 
sult in  non-words,  for  example  correcting  graffe  to  giraffe,  but  looking 
only  at  the  word  in  isolation. 
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REAL-WORD 

ERRORS 


5.2 


INSERTION 

DELETION 

SUBSTITU- 

TION 


3.  context-dependent  error  detection  and  correction:  Using  the  con- 
text to  help  detect  and  correct  spelling  errors  even  if  they  acciden- 
tally result  in  an  actual  word  of  English  (real-word  errors).  This 
can  happen  from  typographical  errors  (insertion,  deletion,  transposi- 
tion) which  accidently  produce  a real  word  (e.g.  there  for  three),  or 
because  the  writer  substituted  the  wrong  spelling  of  a homophone  or 
near-homophone  (e.g.  dessert  for  desert,  or  piece  for  peace). 

The  next  section  will  discuss  the  kinds  of  spelling-error  patterns  that 
occur  in  typed  text  and  OCR  and  handwriting -recognition  input. 


Spelling  Error  Patterns 

The  number  and  nature  of  spelling  errors  in  human  typed  text  differs  from 
those  caused  by  pattern-recognition  devices  like  OCR  and  handwriting  rec- 
ognizers. Grudin  (1983)  found  spelling  error  rates  of  between  1%  and  3% 
in  human  typewritten  text  (this  includes  both  non-word  errors  and  real-word 
errors).  This  error  rate  goes  down  significantly  for  copy-edited  text.  The 
rate  of  spelling  errors  in  handwritten  text  itself  is  similar;  word  error  rates  of 
between  1.5%  and  2.5%  have  been  reported  (Kukich,  1992). 

The  errors  of  OCR  and  on-line  hand-writing  systems  vary.  Yaeger  et  al. 
(1998)  propose,  based  on  studies  that  they  warn  arc  inconclusive,  that  the 
online  printed  character  recognition  on  Apple  Computer’s  NEWTON  MES- 
SAGEPAD  has  a word  accuracy  rate  of  97%-98%,  i.e.  an  error  rate  of  2%-3%, 
but  with  a high  variance  (depending  on  the  training  of  the  writer,  etc).  OCR 
error  rates  also  vary  widely  depending  on  the  quality  of  the  input;  (Lopresti 
and  Zhou,  1997)  suggest  that  OCR  letter-error  rates  typically  range  from 
0.2%  for  clean,  first-generation  copy  to  20%  or  worse  for  multigeneration 
photocopies  and  faxes. 

In  an  early  study,  Damerau  (1964)  found  that  80%  of  all  misspelled 
words  (non-word  errors)  in  a sample  of  human  keypunched  text  were  caused 
by  single-error  misspellings:  a single  one  of  the  following  errors:1 

• insertion:  mistyping  the  as  ther 

• deletion:  mistyping  the  as  tli 

• substitution:  mistyping  the  as  thw 


1 In  another  corpus,  Peterson  (1986)  found  that  single-error  misspellings  accounted  for  an 
even  higher  percentage  of  all  misspelled  words  (93% — 95%).  The  difference  between  the  80% 
and  the  higher  figure  may  be  due  to  the  fact  that  Damerau's  text  included  errors  caused  in 
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• transposition:  mistyping  the  as  hte 

Because  of  this  study,  much  following  research  has  focused  on  the 
correction  of  single-error  misspellings.  Indeed,  the  first  algorithm  we  will 
present  later  in  this  chapter  relies  on  the  large  proportion  of  single-error  mis- 
spellings. 

Kukich  (1992)  breaks  down  human  typing  errors  into  two  classes.  Ty- 
pographic errors  (for  example  misspelling  spell  as  speel ),  arc  generally 
related  to  the  keyboard.  Cognitive  errors  (for  example  misspelling  sepa- 
rate as  seperate)  arc  caused  by  writers  who  not  not  know  how  to  spell  the 
word.  Grudin  (1983)  found  that  the  keyboard  was  the  strongest  influence  on 
the  errors  produced;  typographic  errors  constituted  the  majority  of  all  error 
types.  For  example  consider  substitution  errors,  which  were  the  most  com- 
mon error  type  for  novice  typists,  and  the  second  most  common  error  type 
for  expert  typists.  Grudin  found  that  immediately  adjacent  keys  in  the  same 
row  accounted  for  59%  of  the  novice  substitutions  and  31%  of  the  error  sub- 
stitutions (e.g.  smsll  for  small).  Adding  in  errors  in  the  same  column  and 
homologous  errors  (hitting  the  corresponding  key  on  the  opposite  side  of 
the  keyboard  with  the  other  hand),  a total  of  83%  of  the  novice  substitutions 
and  51%  of  the  expert  substitutions  could  be  considered  keyboard-based  er- 
rors. Cognitive  errors  included  phonetic  errors  (substituting  a phonetically 
equivalent  sequence  of  letters  ( seperate  for  separate)  and  homonym  errors 
(substituting  piece  for  peace).  Homonym  errors  will  be  discussed  in  Chap- 
ter 7 when  we  discuss  real-word  error  correction. 

While  typing  errors  arc  usually  characterized  as  substitutions,  inser- 
tions, deletions,  or  transpositions,  OCR  errors  are  usually  grouped  into  five 
classes:  substitutions,  multisubstitutions,  space  deletions  or  insertions,  and 
failures.  Lopresti  and  Zhou  (1997)  give  the  following  example  of  common 
OCR  errors: 

Correct: 

The  quick  brown  fox  jumps  over  the  lazy  dog. 

Recognized: 

'lhe  q~  ick  brown  foxjurnps  over  tb  1 azy  dog. 

Substitutions  (e  — > c)  arc  generally  caused  by  visual  similarity  (rather 
than  keyboard  distance),  as  arc  multisubstitutions  (T  —t  7,  m —>  rn,  he  — > 
b).  Multisubstitutions  arc  also  often  called  framing  errors.  Failures  (repre- 


TRANSPOSI- 

TION 


transcription  to  punched  card  forms,  errors  in  keypunching,  and  errors  caused  by  paper  tape 
equipment  (!)  in  addition  to  purely  human  misspellings. 
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sented  by  the  tilde  character  ’ : u — > ~ ) are  cases  where  the  OCR  algorithm 
does  not  select  any  letter  with  sufficient  accuracy. 


5.3  Detecting  Non-Word  Errors 

Detecting  non-word  errors  in  text,  whether  typed  by  humans  or  scanned,  is 
most  commonly  done  by  the  use  of  a dictionary.  For  example,  the  word 
foxjumps  in  the  OCR  example  above  would  not  occur  in  a dictionary.  Some 
early  research  (Peterson,  1986)  had  suggested  that  such  spelling  dictionar- 
ies would  need  to  be  kept  small,  because  large  dictionaries  contain  very  rare 
words  that  resemble  misspellings  of  other  words.  For  example  wont  is  a 
legitimate  but  rare  word  but  is  a common  misspelling  of  won ’t.  Similarly, 
veery  (a  kind  of  thrush)  might  also  be  a misspelling  of  very.  Based  on  a sim- 
ple model  of  single-error  misspellings,  Peterson  showed  that  it  was  possible 
that  10%  of  such  misspellings  might  be  ‘hidden’  by  real  words  in  a 50,000 
word  dictionary,  but  that  15%  of  single-error  misspellings  might  be  ‘hidden’ 
in  a 350,000  word  dictionary.  In  practice,  Damerau  and  Mays  (1989)  found 
that  this  was  not  the  case;  while  some  misspellings  were  hidden  by  real 
words  in  a larger  dictionary,  in  practice  the  larger  dictionary  proved  more 
help  than  harm. 

Because  of  the  need  to  represent  productive  inflection  (the  -s  and  ed 
suffixes)  and  derivation,  dictionaries  for  spelling  error  detection  usually  in- 
clude models  of  morphology,  just  as  the  dictionaries  for  text-to-speech  we 
saw  in  Chapter  3 and  Chapter  4.  Early  spelling  error  detectors  simply  al- 
lowed any  word  to  have  any  suffix  - thus  Unix  SPELL  accepts  bizarre  pre- 
fixed words  like  misclam  and  antiundoggingly  and  suffixed  words  based  on 
the  like  thehood  and  theness.  Modern  spelling  error  detectors  use  more 
linguistically-motivated  morphological  representations  (see  Chapter  3). 


5.4  Probabilistic  Models 

This  section  introduces  probabilistic  models  of  pronunciation  and  spelling 
variation.  These  models,  particularly  the  Bayesian  inference  or  noisy  chan- 
nel model,  will  be  applied  throughout  this  book  to  many  different  problems. 

We  claimed  earlier  that  the  problem  of  ASR  pronunciation  modeling, 
and  the  problem  of  spelling  collection  for  typing  or  for  OCR,  can  be  modeled 
as  problems  of  mapping  from  one  string  of  symbols  to  another.  For  speech 
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recognition,  given  a string  of  symbols  representing  the  pronunciation  of  a 
word  in  context,  we  need  to  figure  out  the  string  of  symbols  representing 
the  lexical  or  dictionary  pronunciation,  so  we  can  look  the  word  up  in  the 
dictionary  Similarly,  given  the  incorrect  sequence  of  letters  in  a mis-spelled 
word,  we  need  to  figure  out  the  correct  sequence  of  letters  in  the  correctly- 
spelled  word. 


The  intuition  of  the  noisy  channel  model  (see  Figure  5.1)  is  to  treat 
the  surface  form  (the  ‘reduced’  pronunciation  or  misspelled  word)  as  an  in- 
stance of  the  lexical  form  (the  ‘lexical’  pronunciation  or  correctly-spelled 
word)  which  has  been  passed  through  a noisy  communication  channel.  This 
channel  introduces  ‘noise’  which  makes  it  hard  to  recognize  the  ‘true’  word. 
Our  goal  is  then  to  build  a model  of  the  channel  so  that  we  can  figure  out  how 
it  modified  this  ‘true’  word  and  hence  recover  it.  For  the  complete  speech 
recognition  tasks,  there  arc  many  sources  of  ‘noise’;  variation  in  pronun- 
ciation, variation  in  the  realization  of  phones,  acoustic  variation  due  to  the 
channel  (microphones,  telephone  networks,  etc).  Since  this  chapter  focuses 
on  pronunciation,  what  we  mean  by  ‘noise’  here  is  the  variation  in  pronun- 
ciation that  masks  the  lexical  or  ‘canonical’  pronunciation;  the  other  sources 
of  noise  in  a speech  recognition  system  will  be  discussed  in  Chapter  7.  For 
spelling  error  detection,  what  we  mean  by  noise  is  the  spelling  errors  which 
mask  the  correct  spelling  of  the  word.  The  metaphor  of  the  noisy  channel 
comes  from  the  application  of  the  model  to  speech  recognition  in  the  IBM 
labs  in  the  70’s  (Jelinek,  1976).  But  the  algorithm  itself  is  a special  case  of 
Bayesian  inference  and  as  such  has  been  known  since  the  work  of  Bayes 
(1763).  Bayesian  inference  or  Bayesian  classification  was  applied  success- 
fully to  language  problems  as  early  as  the  late  1950's,  including  the  OCR 
work  of  Bledsoe  in  1959,  and  the  seminal  work  of  Mosteller  and  Wallace 
(1964)  on  applying  Bayesian  inference  to  determine  the  authorship  of  the 
Federalist  papers. 

In  Bayesian  classification,  as  in  any  classification  task,  we  arc  given 
some  observation  and  our  job  is  to  determine  which  of  a set  of  classes  it 
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belongs  to.  For  speech  recognition,  imagine  for  the  moment  that  the  ob- 
servation is  the  string  of  phones  which  make  up  a word  as  we  hear  it.  For 
spelling  error  detection,  the  observation  might  be  the  string  of  letters  that 
constitute  a possibly-misspelled  word.  In  both  cases,  we  want  to  classify 
the  observations  into  words;  thus  in  the  speech  case,  no  matter  which  of  the 
many  possible  ways  the  word  about  is  pronounced  (see  Chapter  4)  we  want 
to  classify  it  as  about.  In  the  spelling  case,  no  matter  how  the  word  separate 
is  misspelled,  we’d  like  to  recognize  it  as  separate. 

Let’s  begin  with  the  pronunciation  example.  We  arc  given  a string  of 
phones  (say  [ni]).  We  want  to  know  which  word  corresponds  to  this  string 
of  phones.  The  Bayesian  interpretation  of  this  task  starts  by  considering  all 
possible  classes  — in  this  case,  all  possible  words.  Out  of  this  universe  of 
words,  we  want  to  chose  the  word  which  is  most  probable  given  the  ob- 
servation we  have  ([ni]).  In  other  words,  we  want,  out  of  all  words  in  the 
vocabulary  V the  single  word  such  that  P(word|observation)  is  highest.  We 
use  w to  mean  ‘our  estimate  of  the  correct  w’,  and  we’ll  use  O to  mean  ‘the 
observation  sequence  [ni] ' (we  call  it  a sequence  because  we  think  of  each 
letter  as  an  individual  observation).  Then  the  equation  for  picking  the  best 
word  given  is: 


w = ai'gmaxP(wjO) 

weV 


(5.1) 


The  function  argmaxr/(x)  means  ‘the  x such  that  f(x)  is  maximized’. 
While  (5.1)  is  guaranteed  to  give  us  the  optimal  word  w,  it  is  not  clear  how 
to  make  the  equation  operational;  that  is,  for  a given  word  w and  observation 
sequence  O we  don’t  know  how  to  directly  compute  P(w\0).  The  intuition  of 
Bayesian  classification  is  to  use  Bayes’  rule  to  transform  (5.1)  into  a product 
of  two  probabilities,  each  of  which  turns  out  to  be  easier  to  compute  than 
P(w\0).  Bayes’  rule  is  presented  in  (5.2);  it  gives  us  a way  to  break  down 
P(x\0)  into  three  other  probabilities: 

P(y\x)P(x) 


P(x\y)  = 


P(y ) 


We  can  see  this  by  substituting  (5.2)  into  (5.1)  to  get  (5.3): 
P(0\w)P(w) 


w = argmax  ■ 

wgV 


P(O) 


(5.2) 


(5.3) 


The  probabilities  on  the  right  hand  side  of  (5.3)  are  for  the  most  paid 
easier  to  compute  than  the  probability  P{wO)  which  we  were  originally  try- 
ing to  maximize  in  (5.1).  For  example,  P(w),  the  probability  of  the  word 
itself,  we  can  estimate  by  the  frequency  of  the  word.  And  we  will  see  below 
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that  P(0\w)  turns  out  to  be  easy  to  estimate  as  well.  But  P{0),  the  probabil- 
ity of  the  observation  sequence,  turns  out  to  be  harder  to  estimate.  Luckily, 
we  can  ignore  P(O).  Why?  Since  we  are  maximizing  over  all  words,  we  will 
be  computing  for  each  word.  But  P(O)  doesn’t  change  for  each 

word;  we  arc  always  asking  about  the  most  likely  word  string  for  the  same 
observation  O,  which  must  have  the  same  probability  P[0).  Thus: 


w 


= ai'gmax 

wgV 


P(0\w)P(w) 

P(O) 


ai'gmax P(0\w)  P{w) 

wev 


(5.4) 


To  summarize,  the  most  probable  word  w given  some  observation  O 
can  be  computing  by  taking  the  product  of  two  probabilities  for  each  word, 
and  choosing  the  word  for  which  this  product  is  greatest.  These  two  terms 
have  names;  P{w)  is  called  the  Prior  probability,  and  P(()  w)  is  called  the 

likelihood. 


PRIOR 

LIKELIHOOD 


Key  Concept  #3.  w 


likelihood  prior 
ai'gmax  P(0\w)  P(w) 

wgV 


(5.5) 


In  the  next  sections  we  will  show  how  to  compute  these  two  probabili- 
ties for  the  probabilities  of  pronunciation  and  spelling. 


5.5  Applying  the  Bayesian  method  to  spelling 


There  are  many  algorithms  for  spelling  correction;  we  will  focus  on  the 
Bayesian  (or  noisy  channel)  algorithm  because  of  its  generality.  Chapter  6 
will  show  how  this  algorithm  can  be  extended  to  model  real-word  spelling 
errors;  this  section  will  focus  on  non- word  spelling  errors.  The  noisy  chan- 
nel approach  to  spelling  correction  was  first  suggested  by  Kernighan  el  al. 
(1990);  their  program,  correct,  takes  words  rejected  by  the  Unix  spell  pro- 
gram, generates  a list  of  potential  correct  words,  rank  them  according  to 
Equation  (3),  and  picks  the  highest-ranked  one. 

Let’s  walk  through  the  algorithm  as  it  applies  to  Kernighan  el  a IPs 
(1990)  example  misspelling  acress.  The  algorithm  has  two  stages:  proposing 
candidate  corrections  and  scoring  the  candidates. 

In  order  to  propose  candidate  corrections  Kernighan  et  al.  make  the 
simplifying  assumption  that  the  correct  word  will  differ  from  the  misspelling 
by  a single  insertion,  deletion,  substitution,  or  transposition.  As  Damerau’s 
(1964)  results  show,  even  though  this  assumption  causes  the  algorithm  to 
miss  some  corrections,  it  should  handle  most  spelling  errors  in  human  typed 
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text.  The  list  of  candidate  words  is  generated  from  the  typo  by  applying  any 
single  transformation  which  results  in  a word  in  a large  on-line  dictionary. 
Applying  all  possible  transformations  to  acress  yields  the  list  of  candidate 
words  in  Figure  5.2. 


Transformation 

Collect 

Error 

Position 

Error 

Collection 

Letter 

Letter 

(Letter  #) 

Type 

acress 

actress 

t 

- 

2 

deletion 

acress 

cress 

- 

a 

0 

insertion 

acress 

caress 

ca 

ac 

0 

transposition 

acress 

access 

c 

r 

2 

substitution 

acress 

across 

o 

e 

3 

substitution 

acress 

acres 

- 

2 

5 

insertion 

acress 

acres 

- 

2 

4 

insertion 

Figure  5.2  Candidate  corrections  for  the  misspelling  acress , together  with 
the  transformations  that  would  have  produced  the  error,  after  Kernighan  el  al. 

(1990).  represents  a null  letter. 

The  second  stage  of  the  algorithm  scores  each  correction  by  Equa- 
tion 5.4.  Let  t represent  the  typo  (the  misspelled  word),  and  let  c range  over 
the  set  C of  candidate  corrections.  The  most  likely  correction  is  then: 

likelihood  prior 

c = argmax  P(t\c)  P(c)  (5.6) 

cgC 

As  in  Equation  5.4  we  have  omitted  the  denominator  in  Equation  5.6 
since  the  typo  t,  and  hence  its  probability  P(t),  is  constant  for  all  c.  The 
prior  probability  of  each  correction  P(c)  can  be  estimated  by  counting  how 
normalizing  often  the  word  c occurs  in  some  corpus,  and  then  normalizing  these  counts 
by  the  total  count  of  all  words.2  So  the  probability  of  a particular  correction 
word  c is  computed  by  dividing  the  count  of  c by  the  number  N of  words 
in  the  corpus.  Zero  counts  can  cause  problems,  and  so  we  will  add  .5  to  all 
the  counts.  This  is  called  ‘smoothing’,  and  will  be  discussed  in  Chapter  6; 
note  that  in  Equation  5.7  we  can’t  just  divide  by  the  total  number  of  words 
N since  we  added  .5  to  the  counts  of  all  the  words,  so  we  add  .5  for  each  of 

2 Normalizing  means  dividing  by  some  total  count  so  that  the  resulting  probabilities  fall 
legally  between  0 and  1 . 
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the  V words  in  the  vocabulary). 
C(c)  +0.5 


P(c)  = 


N + 0.5V 


(5.7) 


Chapter  6 will  talk  more  about  the  role  of  corpora  in  computing  prior 
probabilities;  for  now  let’s  use  the  corpus  of  Kernighan  et  al.  (1990),  which 
is  the  1988  AP  newswire  corpus  of  44  million  words.  Thus  N is  44  million. 
Since  in  this  corpus,  the  word  actress  occurs  1343  times,  the  word  acres 
2879  times,  and  so  on,  the  resulting  prior  probabilities  arc  as  follows: 


c 

freq(c)  p(c) 

actress 

1343 

.0000315 

cress 

0 

.000000014 

caress 

4 

.0000001 

access 

2280 

.000058 

across 

8436 

.00019 

acres 

2879 

.000065 

Computing  the  likelihood  term  p(t\c)  exactly  is  an  unsolved  (unsolve  - 
able?)  research  problem;  the  exact  probability  that  a word  will  be  mistyped 
depends  on  who  the  typist  was,  how  familial-  they  were  with  the  keyboard 
they  were  using,  whether  one  hand  happened  to  be  more  tired  than  the  other, 
etc.  Luckily,  while  p(t\c)  cannot  be  computed  exactly,  it  can  be  estimated 
pretty  well,  because  the  most  important  factors  predicting  an  insertion,  dele- 
tion, transposition  are  simple  local  factors  like  the  identity  of  the  correct 
letter  itself,  how  the  letter  was  misspelled,  and  the  surrounding  context.  For 
example,  the  letters  m and  n are  often  substituted  for  each  other;  this  is  partly 
a fact  about  their  identity  (these  two  letters  are  pronounced  similarly  and 
they  are  next  to  each  other  on  the  keyboard),  and  partly  a fact  about  context 
(because  they  are  pronounced  similarly,  they  occur  in  similar  contexts). 

One  simple  way  to  estimate  these  probabilities  is  the  one  that  Kernighan 
et  al.  (1990)  used.  They  ignored  most  of  the  possible  influences  on  the  prob- 
ability of  an  error  and  just  estimated  e.g.  p(acress\across)  using  the  number 
of  times  that  e was  substituted  for  o in  some  large  corpus  of  errors.  This  is 
represented  by  a confusion  matrix,  a square  26x26  table  which  represents 
the  number  of  times  one  letter  was  incorrectly  used  instead  of  another.  For 
example,  the  cell  labeled  [o,e\  in  a substitution  confusion  matrix  would  give 
the  count  of  times  that  e was  substituted  for  o.  The  cell  labeled  [t..v]  in  an 
insertion  confusion  matrix  would  give  the  count  of  times  that  t was  inserted 
after  s.  A confusion  matrix  can  be  computed  by  hand-coding  a collection 
of  spelling  errors  with  the  correct  spelling  and  then  counting  the  number 
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of  times  different  errors  occurred  (this  has  been  done  by  Grudin  (1983)). 
Kernighan  et  al.  (1990)  used  four  confusion  matrices,  one  for  each  type  of 
single-error: 

• del  [x.y]  contains  the  number  of  times  in  the  training  set  that  the  char- 
acters xy  in  the  correct  word  were  typed  as  x. 

• insfx.y]  contains  the  number  of  times  in  the  training  set  that  the  char- 
acter x in  the  correct  word  was  typed  as  xy. 

• subjx.y]  the  number  of  times  that  x was  typed  as  y 

• transfx.y]  the  number  of  times  that  xy  was  typed  as  yx. 


Note  that  they  chose  to  condition  their  insertion  and  deletion  proba- 
bilities on  the  previous  character;  they  could  also  have  chosen  to  condition 
on  the  following  character.  Using  these  matrices,  they  estimated  p(t\c)  as 
follows  (where  cp  is  the  p,h  character  of  the  word  c): 


P(t\c) 


del[c;,_i,c;,  if  deletion 
eount[cp_icp]  - 11  aeieuon 

ins if  insertion 
eount[cp_i]  s 11  msemon 


subjfpx,,] 
eount[cp] 

trans[cp,cp+i] 

count[cpcp+f 


, if  substitution 
, if  transposition 


(5.8) 


Figure  5.3  shows  the  final  probabilities  for  each  of  the  potential  correc- 
tions; the  prior  (from  Equation  5.7)  is  multiplied  by  the  likelihood  (computed 
using  Equation  5.8  and  the  confusion  matrices).  The  final  column  shows  the 
‘normalized  percentage’. 


c 

freq(c) 

P(c) 

P(t  c) 

p(t|c)p(c) 

% 

actress 

1343 

.0000315 

.000117 

3.69  x 10-y 

37% 

cress 

0 

.000000014 

.00000144 

2.02  x 10-14 

0% 

caress 

4 

.0000001 

.00000164 

1.64  x 10~13 

0% 

access 

2280 

.000058 

.000000209 

1.21  x 10-11 

0% 

across 

8436 

.00019 

.0000093 

1.77  x 10~9 

18% 

acres 

2879 

.000065 

.0000321 

2.09  x 10~9 

21% 

acres 

2879 

.000065 

.0000342 

2.22  x 10-9 

23% 

Figure  5.3  Computation  of  the  ranking  for  each  candidate  correction.  Note 

that  the  highest  ranked  word  is  not  actress  but  acres  (the  two  lines  at  the  bottom 

of  the  table),  since  acres  can  be  generated  in  two  ways.  The  del[],  ins[],  sub[]. 

and  trans[]  confusion  matrices  are  given  in  full  in  Kernighan  el  al.  (1990). 
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This  implementation  of  the  Bayesian  algorithm  predicts  acres  as  the 
correct  word  (at  a total  normalized  percentage  of  45%),  and  actress  as  the 
second  most  likely  word.  Unfortunately,  the  algorithm  was  wrong  here:  the 
writer’s  intention  becomes  clear  from  the  context:  . . . was  called  a “stellar 
and  versatile  acress  whose  combination  of  sass  and  glamour  has  defined 
her. . . ”.  The  surrounding  words  make  it  clear  that  actress  and  not  acres  was 
the  intended  word;  Chapter  6 will  show  how  to  augment  the  computation  of 
the  prior  probability  to  use  the  surrounding  words. 

The  algorithm  as  we  have  described  it  requires  hand-annotated  data  to 
train  the  confusion  matrices.  An  alternative  approach  used  by  Kernighan 
et  al.  (1990)  is  to  compute  the  matrices  by  iteratively  using  this  very  spelling 
error  correction  algorithm  itself.  The  iterative  algorithm  first  initializes  the 
matrices  with  equal  values;  thus  any  character  is  equally  likely  to  be  deleted, 
equally  likely  to  be  substituted  for  any  other  character,  etc.  Next  the  spelling 
error  correction  algorithm  is  run  on  a set  of  spelling  errors.  Given  the  set 
of  typos  paired  with  their  corrections,  the  confusion  matrices  can  now  be 
recomputed,  the  spelling  algorithm  run  again,  and  so  on.  This  clever  method 
turns  out  to  be  an  instance  of  the  important  EM  algorithm  (Dempster  et  ah, 
1977)  that  we  will  discuss  in  Chapter  7 and  Appendix  D.  Kernighan  et  al. 
(1990)’s  algorithm  was  evaluated  by  taking  some  spelling  errors  that  had 
two  potential  corrections,  and  asking  three  human  judges  to  pick  the  best 
correction.  Their  program  agreed  with  the  majority  vote  of  the  human  judges 
87%  of  the  time. 


5.6  Minimum  Edit  Distance 

The  previous  section  showed  that  the  Bayesian  algorithm,  as  implemented 
with  confusion  matrices,  was  able  to  rank  candidate  corrections.  But  Kernighan 
et  al.  (1990)  relied  on  the  simplifying  assumption  that  each  word  had  only  a 
single  spelling  error.  Suppose  we  wanted  a more  powerful  algorithm  which 
could  handle  the  case  of  multiple  errors?  We  could  think  of  such  an  algo- 
rithm as  a general  solution  to  the  problem  of  string  distance.  The  ‘string  distance 
distance’  is  some  metric  of  how  alike  two  strings  are  to  each  other.  The 
Bayesian  method  can  be  viewed  as  a way  of  applying  such  an  algorithm  to 
the  spelling  error  correction  problem;  we  pick  the  candidate  word  which  is 
‘closest’  to  the  error  in  the  sense  of  having  the  highest  probability  given  the 
error. 

One  of  the  most  popular  classes  of  algorithms  for  finding  string  dis- 
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ALIGNMENT 


tance  arc  those  that  use  some  version  of  the  minimum  edit  distance  algo- 
rithm, named  by  Wagner  and  Fischer  (1974)  but  independently  discovered 
by  many  people;  see  the  History  section.  The  minimum  edit  distance  be- 
tween two  strings  is  the  minimum  number  of  editing  operations  (insertion, 
deletion,  substitution)  needed  to  transform  one  string  into  another.  For  ex- 
ample the  gap  between  intention  and  execution  is  5 operations,  which  can  be 
represented  in  three  ways;  as  a trace,  an  alignment,  or  a operation  list  as 
show  in  Figure  5.4. 
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Figure  5.4  Three  methods  for  representing  differences  between  sequences 
(after  Kruskal  (1983)) 

We  can  also  assign  a particular  cost  or  weight  to  each  of  these  oper- 
ations. The  Levenshtein  distance  between  two  sequences  is  the  simplest 
weighting  factor  in  which  each  of  the  three  operations  has  a cost  of  1 (Lev- 
enshtein, 1966).  Thus  the  Levenshtein  distance  between  intention  and  exe- 
cution is  5.  Levenshtein  also  proposed  an  alternate  version  of  his  metric  in 
which  each  insertion  or  deletion  has  a cost  of  one,  and  substitutions  arc  not 
allowed  (equivalent  to  allowing  substitution,  but  giving  each  substitution  a 
cost  of  2,  since  any  substitution  can  be  represented  by  1 insertion  and  1 dele- 
tion). Using  this  version,  the  Levenshtein  distance  between  intention  and 
execution  is  8.  We  can  also  weight  operations  by  more  complex  functions, 
for  example  by  using  the  confusion  matrices  discussed  above  to  assign  a 
probability  to  each  operation.  In  this  case  instead  of  talking  about  the  ‘mini- 
mum edit  distance’  between  two  strings,  we  arc  talking  about  the  ‘maximum 
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probability  alignment’  of  one  string  with  another.  If  we  do  this,  an  aug- 
mented minimum  edit  distance  algorithm  which  multiplies  the  probabilities 
of  each  transformation  can  be  used  to  estimate  the  Bayesian  likelihood  of  a 
multiple-error  typo  given  a candidate  collection. 

The  minimum  edit  distance  is  computed  by  dynamic  programming,  program- 
Dynamic  programming  is  the  name  for  a class  of  algorithms,  first  introduced 
by  Bellman  (1957),  that  apply  a table-driven  method  to  solve  problems  by 
combining  solutions  to  subproblems.  This  class  of  algorithms  includes  the 
most  commonly-used  algorithms  in  speech  and  language  processing,  among 
them  the  minimum  edit  distance  algorithm  for  spelling  error  collection  the 
Viterbi  algorithm  and  the  forward  algorithm  which  are  used  both  in  speech 
recognition  and  in  machine  translation,  and  the  CYK  and  Earley  algorithm 
used  in  parsing.  We  will  introduce  the  minimum-edit-distance,  Viterbi,  and 
forward  algorithms  in  this  chapter  and  Chapter  7,  the  Earley  algorithm  in 
Chapter  10,  and  the  CYK  algorithm  in  Chapter  12. 

The  intuition  of  a dynamic  programming  problem  is  that  a large  prob- 
lem can  be  solved  by  properly  combining  the  solutions  to  various  subprob- 
lems. For  example,  consider  the  sequence  or  ‘path’  of  transformed  words 
that  comprise  the  minimum  edit  distance  between  the  strings  intention  and 
execution.  Imagine  some  string  (perhaps  it  is  exention ) that  is  in  this  opti- 
mal path  (whatever  it  is).  The  intuition  of  dynamic  programming  is  that  if 
exention  is  in  the  optimal  operation-list,  then  the  optimal  sequence  must  also 
include  the  optimal  path  from  intention  to  exention.  Why?  If  there  were  a 
shorter  path  from  intention  to  exention  then  we  could  use  it  instead,  resulting 
in  a shorter  overall  path,  and  the  optimal  sequence  wouldn’t  be  optimal,  thus 
leading  to  a contradiction. 

Dynamic  programming  algorithms  for  sequence  comparison  work  by 
creating  a distance  matrix  with  one  column  for  each  symbol  in  the  target  se- 
quence and  one  row  for  each  symbol  in  the  source  sequence  (i.e.  target  along 
the  bottom,  source  along  the  side).  For  minimum  edit  distance,  this  matrix 
is  the  edit-distance  matrix.  Each  cell  edit-distance [ij]  contains  the  distance 
between  the  first  i characters  of  the  target  and  the  first  j characters  of  the 
source.  Each  cell  can  be  computed  as  a simple  function  of  the  surrounding 
cells;  thus  stalling  from  the  beginning  of  the  matrix  it  is  possible  to  fill  in 
every  entry.  The  value  in  each  cell  is  computing  by  taking  the  minimum  of 
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the  three  possible  paths  through  the  matrix  which  arrive  there: 

{distance^  — 1,  j]  + ins-cost  {target  •) 

distance  [i—  1 . j — 1 ] + sub st -cost  (source  j . target ,• ) (5.9) 
distance^, j — 1]  + ins-cost(source  j)) 

The  algorithm  itself  is  summarized  in  Figure  5.5,  while  Figure  5.6 
shows  the  results  of  applying  the  algorithm  to  the  distance  between  inten- 
tion and  execution  assuming  the  version  of  Levenshtein  distance  in  which 
insertions  and  deletions  each  have  a cost  of  1 and  substitutions  have  a cost 
of  2. 


function  Min-Edit-Dis T A N C I l{ta rget,  source ) returns  min-distance 

n <-  Length  (target) 
m <—  LENGTH(sonrce) 

Create  a distance  matrix  distance[n+ 1 ,m+ 1 ] 
distance[0,0]  ■(— 0 
for  each  column  i from  0 to  n do 
for  each  row  j from  0 to  m do 

distanced ( ^ M I N(  distance[i—\,j]  + ins-cost(targetj), 

distance[i—\,j—\]  + subst-cost(source j,  target,), 
distanced,  j—  1]  + ins-cost(source j)) 


Figure  5.5  The  minimum  edit  distance  algorithm,  an  example  of  the  class 
of  dynamic  programming  algorithms. 


5.7  English  Pronunciation  Variation 

. . . when  any  of  the  fugitives  of  Ephraim  said:  ‘Let  me  go  over,’  the 
men  of  Gilead  said  unto  him:  ‘Art  thou  an  Ephraimite?’  If  he  said: 
‘Nay’;  then  said  they  unto  him:  ‘Say  now  Shibboleth’;  and  he  said 
‘Sibboleth’;  for  he  could  not  frame  to  pronounce  it  right;  then  they  laid 
hold  on  him,  and  slew  him  at  the  fords  of  the  Jordan; 

Judges  12:5-6 

This  passage  from  Judges  is  a rather  gory  reminder  of  the  political 
importance  of  pronunciation  variation.  Even  in  our  (hopefully  less  politi- 
cal) computational  applications  of  pronunciation,  it  is  important  to  correctly 
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Figure  5.6  Computation  of  minimum  edit  distance  between  intention  and 

execution  via  algorithm  of  Figure  5.5,  using  Levenshtein  distance  with  cost  of 

1 for  insertions  or  deletions,  2 for  substitutions, 
itself  has  a cost  of  0. 

Substitution  of  a character  for 

model  how  pronunciations  can  vary.  We  have  already  seen  that  a phoneme 
can  be  realized  as  different  allophones  in  different  phonetic  environments. 

We  have  also  shown  how  to  write  rules  and  transducers  to  model  these 
changes  for  speech  synthesis.  Unfortunately,  these  models  significantly  sim- 
plified the  nature  of  pronunciation  variation.  In  particular,  pronunciation 
variation  is  caused  by  many  factors  in  addition  to  the  phonetic  environment. 

This  section  summarizes  some  of  these  kinds  of  variation;  the  following  sec- 
tion will  introduce  the  probabilistic  tools  for  modeling  it. 

Pronunciation  variation  is  extremely  widespread.  Figure  5.7  shows 
the  most  common  pronunciations  of  the  words  because  and  about  from  the 
hand-transcribed  Switchboard  corpus  of  American  English  telephone  con- 
versations. Note  the  wide  variation  in  pronunciation  for  these  two  words 
when  spoken  as  paid  of  a continuous  stream  of  speech. 

What  causes  this  variation?  There  arc  two  broad  classes  of  pronunci- 
ation variation:  lexical  variation  and  allophonic  variation.  We  can  think  variation 
of  lexical  variation  as  a difference  in  what  segments  arc  used  to  represent  var?ationnic 
the  word  in  the  lexicon,  while  allophonic  variation  is  a difference  in  how  the 
individual  segments  change  their  value  in  different  contexts.  In  Figure  5.7, 
most  of  the  variation  in  pronunciation  is  allophonic;  i.e.  due  to  the  influ- 
ence of  the  surrounding  sounds,  syllable  structure,  etc.  But  the  fact  that  the 
word  because  can  be  pronounced  either  as  monosyllabic  ’cause  or  bisyllabic 
because  is  probably  a lexical  fact,  having  to  do  perhaps  with  the  level  of 
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because 

about 

IPA 

ARPAbet 

% 

IPA 

ARPAbet 

% 

[bikAz] 

[b  iy  k ah  z] 

27% 

[abau] 

[ax  b aw] 

32% 

[bikAz] 

[b  ix  k ah  z] 

14% 

[abaut] 

[ax  b aw  t] 

16% 

[kAz] 

[k  ah  z] 

7% 

[bau] 

[b  aw] 

9% 

[kaz] 

[k  ax  z] 

5% 

[Abau] 

[ix  b aw] 

8% 

[bikaz] 

[b  ix  k ax  z] 

4% 

[ibaut] 

[ix  b aw  t] 

5% 

[bikAz] 

[b  ih  k ah  z] 

3% 

[ibae] 

[ix  b ae] 

4% 

[bakAz] 

[b  ax  k ah  z] 

3% 

[abaer] 

[ax  b ae  dx] 

3% 

[kuz] 

[k  uh  z] 

2% 

[baur] 

[b  aw  dx] 

3% 

M 

[ks] 

2% 

[bae] 

[b  ae] 

3% 

[kiz] 

[k  ix  z] 

2% 

[baut] 

[b  aw  t] 

3% 

[kiz] 

[k  ih  z] 

2% 

[abaur] 

[ax  b aw  dx] 

3% 

[bikA3] 

[b  iy  k ah  zh] 

2% 

[abas] 

[ax  b ae] 

3% 

[bikAs] 

[b  iy  k ah  s] 

2% 

M 

[b  aa] 

3% 

[bikA] 

[b  iy  k ah] 

2% 

[baer] 

[b  ae  dx] 

3% 

[bikaz] 

[b  iy  k aa  z] 

2% 

[ibaur] 

[ix  b aw  dx] 

2% 

[az] 

[ax  z] 

2% 

[ibat] 

[ix  b aa  t] 

2% 

Figure  5.7  The  16  most  common  pronunciations  of  because  and  about 

from  the  hand-transcribed  Switchboard  corpus  of  American  English 

conver- 

sational  telephone  speech  (Godfrey  et  al.,  1992;  Greenberg  et  al.,  1996) 

SOCIOLIN- 

GUISTIC 


DIALECT 

VARIATION 


informality  of  speech. 

An  important  source  of  lexical  variation  (although  it  can  also  affect  al- 
lophonic  variation)  is  sociolinguistic  variation.  Sociolinguistic  variation  is 
due  to  extralinguistic  factors  such  as  the  social  identity  or  background  of  the 
speaker.  One  kind  of  sociolinguistic  variation  is  dialect  variation.  Speak- 
ers of  some  deep-southern  dialects  of  American  English  use  a monophthong 
or  near-monophthong  [a]  or  [ae]  instead  of  a diphthong  in  some  words  with 
the  vowel  [ai].  In  these  dialects  rice  is  pronounced  [ra:s].  African-American 
Vernacular  English  (AAVE)  has  many  of  the  same  vowel  differences  from 
General  American  as  does  Southern  American  English,  and  also  has  indi- 
vidual words  with  specific  pronunciations  such  as  [bidnis]  for  business  and 
[aeks]  for  ask.  For  older  speakers  or  those  not  from  the  American  West  or 
Midwest,  the  words  caught  and  cot  have  different  vowels  ([kot]  and  [kat] 
respectively).  Young  American  speakers  or  those  from  the  West  pronounce 
the  two  words  cot  and  caught  the  same;  the  vowels  [a]  and  [a]  arc  usually 
not  distinguished  in  these  dialects.  For  some  speakers  from  New  York  City 
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like  the  first  author’s  parents,  the  words  Mary,  ([meiri])  marry,  ([maeri])  and 
merry  ([meri])  arc  all  pronounced  differently,  while  other  New  York  City 
speakers  like  the  second  author  pronounce  Mary,  and  merry  identically,  but 
differently  than  marry.  Most  American  speakers  pronounce  all  three  of  these 
words  identically  as  ([meri]).  Students  who  arc  interested  in  dialects  of  En- 
glish should  consult  Wells  (1982),  the  most  comprehensive  study  of  dialects 
of  English  around  the  world. 

Other  sociolinguistic  differences  are  due  to  register  or  style  rather  than 
dialect.  In  a pronunciation  difference  that  is  due  to  style,  the  same  speaker 
might  pronounce  the  same  word  differently  depending  on  who  they  were 
talking  to  or  what  the  social  situation  is;  this  is  probably  the  case  when 
choosing  between  because  and  ’cause  above.  One  of  the  most  well-studied 
examples  of  style-variation  is  the  suffix  -mg  (as  in  something),  which  can  be 
pronounced  [up  or  /in/  (this  is  often  written  somethin’).  Most  speakers  use 
both  forms;  as  Labov  (1966)  shows,  they  use  [nj]  when  they  are  being  more 
formal,  and  [in]  when  more  casual.  In  fact  whether  a speaker  will  use  [irj] 
or  [in]  in  a given  situation  varies  markedly  according  to  the  social  context, 
the  gender  of  the  speaker,  the  gender  of  the  other  speaker,  etc.  Wald  and 
Shopen  (1981)  found  that  men  are  more  likely  to  use  the  non-standard  form 
[in]  than  women,  that  both  men  and  women  arc  more  likely  to  use  more  of 
the  standard  form  [iq]  when  the  addressee  is  a women,  and  that  men  (but  not 
women)  tend  to  switch  to  [in]  when  they  arc  talking  with  friends. 

Where  lexical  variation  happens  at  the  lexical  level,  allophonic  varia- 
tion happens  at  the  surface  form  and  reflects  phonetic  and  articulatory  fac- 
tors.3 For  example,  most  of  the  variation  in  the  word  about  in  Figure  5.7 
was  caused  by  changes  in  one  of  the  two  vowels  or  by  changes  to  the  final 
[t].  Some  of  this  variation  is  due  to  the  allophonic  rules  we  have  already 
discussed  for  the  realization  of  the  phoneme  /t /.  For  example  the  pronun- 
ciation of  about  as  [obaur]/[ax  b aw  dx])  has  a flap  at  the  end  because  the 
next  word  was  the  word  it,  which  begins  with  a vowel;  the  sequence  about 
it  was  pronounced  [obauri]/[ax  b aw  dx  ix]).  Similarly  note  that  final  [t]  is 
often  deleted;  ( about  as  [bau]/[b  aw]).  Considering  these  cases  as  ‘deleted’ 
is  actually  a simplification;  many  of  these  ‘deleted’  cases  of  [t]  arc  actually 
realized  as  a slight  change  to  the  vowel  quality  called  glottalization  which 
arc  not  represented  in  these  transcriptions. 


3 Many  linguists  distinguish  between  allophonic  variation  and  what  are  called  ‘optional 
phonological  rules’;  for  the  purposes  of  this  textbook  we  will  lump  these  both  together  as 
‘allophonic  variation’. 


REGISTER 

STYLE 
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COARTICULA- 

TION 


ASSIMILATION 


PALATALIZA- 

TION 


DELETION 


When  we  discussed  these  rules  earlier,  we  implied  that  they  were  de- 
terministic; given  an  environment,  a rule  always  applies.  This  is  by  no  means 
the  case.  Each  of  these  allophonic  rules  is  dependent  on  a complicated  set  of 
factors  that  must  be  interpreted  probabilistically.  In  the  rest  of  this  section 
we  summarize  more  of  these  rules  and  talk  about  the  influencing  factors. 
Many  of  these  rules  model  coarticulation,  which  is  a change  in  a segment 
due  to  the  movement  of  the  articulators  in  neighboring  segments.  Most  al- 
lophonic rules  relating  English  phoneme  to  their  allophones  can  be  grouped 
into  a small  number  of  types:  assimilation,  dissimilation,  deletion,  flapping, 
vowel  reduction,  and  epenthesis. 

Assimilation  is  the  change  in  a segment  to  make  it  more  like  a neigh- 
boring segment.  The  dentalization  of  [t]  to  ([t])  before  the  dental  consonant 
[0]  is  an  example  of  assimilation.  Another  common  type  of  assimilation 
in  English  and  cross-linguistically  is  palatalization.  Palatalization  occurs 
when  the  constriction  for  a segment  occurs  closer  to  the  palate  than  it  nor- 
mally would,  because  the  following  segment  is  palatal  or  alveolo-palatal. 
In  the  most  common  cases,  /s/  becomes  [J],  /z/  becomes  [3],  /t/  becomes  [tj 
and  /d/  becomes  dg].  We  saw  one  case  of  palatalization  in  Figure  5.7  in  the 
pronunciation  of  because  as  [bikA3]  (ARPAbet  [b  iy  k ah  zh]).  Here  the 
final  segment  of  because,  a lexical  /z/,  is  realized  as  [3],  because  the  fol- 
lowing word  was  you ’ve.  So  the  sequence  because  you  ’ve  was  pronounced 
[bikA3uv].  A simple  version  of  a palatalization  rule  might  be  expressed  as 
follows;  Figure  5.8  shows  examples  from  the  Switchboard  corpus. 


M I 

f[J]  I 

[^] 

[3]  [ 

M 

> 1 

[tj] 

[d\  J 

l M J 

(5.10) 


Note  in  Figure  5.8  that  whether  a [t]  is  palatalized  depends  on  lexical 
factors  like  word  frequency  ([t]  is  more  likely  to  be  palatalized  in  frequent 
words  and  phrases). 

Deletion  is  quite  common  in  English  speech.  We  saw  examples  of 
deletion  of  final  It/  above,  in  the  words  about  and  it.  /t / and  /d/  arc  often 
deleted  before  consonants,  or  when  they  arc  paid  of  a sequence  of  two  or 
three  consonants;  Figure  5.9  shows  some  examples. 


0 / V C 


(5.11) 


The  many  factors  that  influence  the  deletion  of  /t / and  /d / have  been 
extensively  studied.  For  example  / d / is  more  likely  to  be  deleted  than  /t/. 
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Phrase 

IPA 

Lexical 

IPA 

Reduced 

ARPAbet 

Reduced 

set  your 

[setjor] 

[setJV] 

[s  eh  ch  er] 

not  yet 

[not  jet] 

[natjet] 

[n  aa  ch  eh  t] 

last  year 

[laestjir] 

[laestjir] 

[1  ae  s ch  iy  r] 

what  you 

[wAtju] 

MJu] 

[w  ax  ch  uw] 

this  year 

[Sisjir] 

[SiJir] 

[dh  ih  sh  iy  r] 

because  you’ve 

[bikxzjuv] 

[bikA3uv] 

[b  iy  k ah  zh  uw  v] 

did  you 

[didju] 

[did3yA] 

[d  ih  jh  y ah] 

Figure  5.8  Examples  of  palatalization  from  the  Switchboard  corpus;  the 

lemma  you  (including  your,  you  ’ve,  and  you ’d)  was  by  far  the  most  common 

cause  of  palatalization,  followed  by  year(s)  (especially  in  the  phrases  this  year 
and  last  year). 

IPA 

IPA 

ARPAbet 

Phrase 

Lexical 

Reduced 

Reduced 

find  him 

[faindhim] 

[fainim] 

[f  ay  n ix  m] 

around  this 

[oraunddis] 

[iraums] 

[ix  r aw  n ih  s] 

mind  boggling 

[mainbaghq] 

[mauling  lit]] 

[m  ay  n b ao  g el  ih  ng] 

most  places 

[moustpleisiz] 

[mouspleisiz] 

[m  ow  s p 1 ey  s ix  z] 

draft  the 

[draeftdi] 

[drasfdi] 

[d  r ae  f dh  iy] 

left  me 

[left, mi] 

[lrfmi] 

[1  eh  f m iy] 

Figure  5.9  Examples  of  /t/  and  /d/  deletion  from  Switchboard.  Some  of 

these  examples  may  have  glottalization  instead  of  being  completely  deleted. 

Both  arc  more  likely  to  be  deleted  before  a consonant  (Labov,  1972).  The 
final  /t/  and  /d/  in  the  words  and  and  just  arc  particularly  likely  to  be  deleted 
(Labov,  1975;  Neu,  1980).  Wolfram  (1969)  found  that  deletion  is  more 
likely  in  faster  or  more  casual  speech,  and  that  younger  people  and  males 
arc  more  likely  to  delete.  Deletion  is  more  likely  when  the  two  words  sur- 
rounding the  segment  act  as  a sort  of  phrasal  unit,  either  occurring  together 
frequently  (Bybee,  1996),  having  a high  mutual  information  or  trigram 
predictability  (Gregory  et  ah,  1999),  or  being  tightly  connected  for  other 
reasons  (Zwicky,  1972).  Fasold  (1972),  Labov  (1972),  and  many  others  have 
shown  that  deletion  is  less  likely  if  the  word-final  /t  / or  / d/  is  the  past  tense 
ending.  For  example  in  Switchboard,  deletion  is  more  likely  in  the  word 
around  (73%  /d/-deletion)  than  in  the  word  turned  (30%  /d/-deletion)  even 
though  the  two  words  have  similar  frequencies. 
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HYPERARTIC- 

ULATES 


REDUCED 

VOWELS 

SCHWA 


The  flapping  rule  is  significantly  more  complicated  than  we  suggested 
in  Chapter  4,  as  a number  of  scholars  have  pointed  out  (see  especially  Rhodes 
(1992)).  The  preceding  vowel  is  highly  likely  to  be  stressed,  although  this  is 
not  necessary  (for  example  there  is  commonly  a flap  in  the  word  thermome- 
ter [Ga^mamira1]).  The  following  vowel  is  highly  likely  to  be  unstressed,  al- 
though again  this  is  not  necessary,  /t/  is  much  more  likely  to  flap  than 
/d/.  There  are  complicated  interactions  with  syllable,  foot,  and  word  bound- 
aides.  Flapping  is  more  likely  to  happen  when  the  speaker  is  speaking  more 
quickly,  and  is  more  likely  to  happen  at  the  end  of  a word  when  it  forms 
a collocation  (high  mutual  information)  with  the  following  word  (Gregory 
et  at,  1999).  Flapping  is  less  likely  to  happen  when  a speaker  hyperar- 
ticulates,  i.e.  uses  a particularly  clear  form  of  speech,  which  often  happens 
when  users  arc  talking  to  computer  speech  recognition  systems  (Oviatt  et  ah, 
1998).  There  is  a nasal  flap  [f]  whose  tongue  movements  resemble  the  oral 
flap  but  in  which  the  velum  is  lowered.  Finally,  flapping  doesn’t  always  hap- 
pen, even  when  the  environment  is  appropriate;  thus  the  flapping  rule,  or 
transducer,  needs  to  be  probabilistic,  as  we  will  see  below. 

We  have  saved  for  last  one  of  the  most  important  phonological  pro- 
cesses: vowel  reduction,  in  which  many  vowels  in  unstressed  syllables  arc 
realized  as  reduced  vowels,  the  most  common  of  which  is  schwa  ([a]). 
Stressed  syllables  arc  those  in  which  more  air  is  pushed  out  of  the  lungs; 
stressed  syllables  arc  longer,  louder,  and  usually  higher  in  pitch  than  un- 
stressed syllables.  Vowels  in  unstressed  syllables  in  English  often  don’t  have 
their  full  form;  the  articulatory  gesture  isn’t  as  complete  as  for  a full  vowel. 
As  a result  the  shape  of  the  mouth  is  somewhat  neutral;  the  tongue  is  nei- 
ther particularly  high  nor  particularly  low.  For  example  the  second  vowels 
in  parakeet  is  schwa:  [paerakit]. 

While  schwa  is  the  most  common  reduced  vowel,  it  is  not  the  only 
one,  at  least  not  in  some  dialects.  Bolinger  (1981)  proposed  three  reduced 
vowels:  a reduced  mid  vowel  [a],  a reduced  front  vowel  [i],  and  a reduced 
rounded  vowel  [e].  But  the  majority  of  computational  pronunciation  lexi- 
cons or  computational  models  of  phonology  systems  limit  themselves  to  one 
reduced  vowel  ([a])  (for  example  PRONLEX  and  CELEX)  or  at  most  two 
([a]  =ARPABET  [ax]  and  [i]  = ARPAbet  [ix]).  Miller  (1998)  was  able  to 
train  a neural  net  to  automatically  categorize  a vowel  as  [a]  or  [i]  based  only 
on  the  phonetic  context,  which  suggests  that  for  speech  recognition  and  text- 
to-speech  purposes,  one  reduced  vowel  is  probably  adequate.  Indeed  Wells 
(1982)  (167-168)  notes  that  [a]  and  [i]  are  falling  together  in  many  dialects  of 
English  including  General  American  and  Irish,  among  others,  a phenomenon 
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he  calls  weak  vowel  merger. 

A final  note:  not  all  unstressed  vowels  arc  reduced;  any  vowel,  and 
diphthongs  in  particular  can  retain  their  full  quality  even  in  unstressed  po- 
sition. For  example  the  vowel  [ei]  (ARPAbet  [ey])  can  appeal-  in  stressed 
position  as  in  the  word  eight ) [ eit]  or  unstressed  position  as  in  the  word  al- 
ways [d.weiz],  Whether  a vowel  is  reduced  depends  on  many  factors.  For 
example  the  word  the  can  be  pronounced  with  a full  vowel  <li  or  reduced 
vowel  3o.  It  is  more  likely  to  be  pronounced  with  the  reduced  vowel  3o  in 
fast  speech,  in  more  casual  situations,  and  when  the  following  word  begins 
with  a consonant.  It  is  more  likely  to  be  pronounced  with  the  full  vowel  5i 
when  the  following  word  begins  with  a vowel  or  when  the  speaker  is  having 
‘planning  problems’;  speakers  are  more  likely  to  use  a full  vowel  than  a re- 
duced one  if  they  don’t  know  what  they  are  going  to  say  next  (Fox  Tree  and 
Clark,  1997).  See  Keating  el  al.  (1994)  and  Jurafsky  el  al.  (1998)  for  more 
details  on  factors  effecting  vowel  reduction  in  the  TIMIT  and  Switchboard 
corpora.  Other  factors  influencing  reduction  include  the  frequency  of  the 
word,  whether  this  is  the  final  vowel  in  a phrase,  and  even  the  idiosyncracies 
of  individual  speakers. 
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HEAD  KNIGHT  OF  NI:  Ni! 


KNIGHTS  OF  NI: 
ARTHUR: 

HEAD  KNIGHT: 

RANDOM: 

ARTHUR: 

HEAD  KNIGHT: 
BEDEVERE: 
HEAD  KNIGHT: 


Ni!  Ni!  Ni!  Ni!  Ni! 

Who  are  you? 

We  are  the  Knights  Who  Say...  ‘Ni’ ! 

Ni! 

No!  Not  the  Knights  Who  Say  ’Ni’ ! 

The  same! 

Who  are  they? 

We  are  the  keepers  of  the  sacred  words:  ‘Ni’,  ‘Peng’ 
and  ‘Neee-wom’! 


Graham  Chapman,  John  Cleese,  Eric  Idle,  Terry  Gilliam,  Terry  Jones, 
and  Michael  Palin,  Monty  Python  and  the  Holy  Grail  1975. 


The  Bayesian  algorithm  that  we  used  to  pick  the  optimal  correction  for 
a spelling  error  can  be  used  to  solve  what  is  often  called  the  pronunciation 
subproblem  in  speech  recognition.  In  this  task,  we  are  given  a series  of 
phones  and  our  job  is  to  compute  the  most  probable  word  which  generated 
them.  For  this  chapter,  we  will  simplify  the  problem  in  an  important  way 
by  assuming  the  correct  string  of  phones.  A real  speech  recognizer  relies  on 
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probabilistic  estimators  for  each  phone,  so  it  is  never  sure  about  the  identity 
of  any  phone.  We  will  relax  this  assumption  in  Chapter  7;  for  now,  let’s  look 
at  the  simpler  problem. 

We’ll  also  begin  with  another  simplification  by  assuming  that  we  al- 
ready know  where  the  word  boundaries  arc.  Later  in  the  chapter,  we’ll  show 
that  we  can  simultaneously  find  word  boundaries  (‘segment’)  and  model  pro- 
nunciation variation. 

Consider  the  particular  problem  of  interpreting  the  sequence  of  phones 
[ni],  when  it  occurs  after  the  word  I at  the  beginning  of  a sentence.  Stop  and 
see  if  you  can  think  of  any  words  which  arc  likely  to  have  been  pronounced 
[ni]  before  you  read  on.  The  word  “Ni”  is  not  allowed. 

You  probably  thought  of  the  word  knee.  This  word  is  in  fact  pro- 
nounced [ni].  But  an  investigation  of  the  Switchboard  corpus  produces  a 
total  of  7 words  which  can  be  pronounced  [ni] ! The  seven  words  arc  the, 
neat,  need,  new,  knee,  to,  and  you. 

How  can  the  word  the  be  pronounced  [ni]?  The  explanation  for  this 
pronunciation  (and  all  the  others  except  the  one  for  knee)  lies  in  the  contextually- 
induced  pronunciation  variation  we  discussed  in  Chapter  4.  For  example,  we 
saw  that  [t]  and  [d]  were  often  deleted  word  finally,  especially  before  coro- 
nals; thus  the  pronunciation  of  neat  as  [ni]  happened  before  the  word  little 
{neat  little  —t  [nilol]).  The  pronunciation  of  the  as  [ni]  is  caused  by  the  re- 
gressive assimilation  process  also  discussed  in  Chapter  4.  Recall  that  in  nasal 
assimilation,  phones  before  or  after  nasals  take  on  nasal  manner  of  articula- 
tion. Thus  [0]  can  be  realized  as  [n].  The  many  cases  of  the  pronounced 
as  [ni]  in  Switchboard  occurred  after  words  like  in,  on,  and  been  (so  in  the 
— > [inni]).  The  pronunciation  of  new  as  [ni]  occurred  most  frequently  in  the 
word  New  York ; the  vowel  [u]  has  fronted  to  [i]  before  a [y]. 

The  pronunciation  of  to  as  [ni]  occurred  after  the  work  talking  {talking 
to  you  — > [tokmiyu]);  here  the  [u]  is  palatalized  by  the  following  [y]  and  the 
[n]  is  functioning  jointly  as  the  final  sound  of  talking  and  the  initial  sound 
of  to.  Because  this  phone  is  paid  of  two  separate  words  we  will  not  tty  to 
model  this  particular  mapping;  for  the  rest  of  this  section  let’s  consider  only 
the  following  five  words  as  candidate  lexical  forms  for  [ni] : knee,  the,  neat, 
need,  new. 

We  saw  in  the  previous  section  that  the  Bayesian  spelling  error  cor- 
rection algorithm  had  two  components:  candidate  generation,  and  candidate 
scoring.  Speech  recognizers  often  use  an  alternative  architecture,  trading 
off  speech  for  storage.  In  this  architecture,  each  pronunciation  is  expanded 
in  advance  with  all  possible  valiants,  which  arc  then  pre-stored  with  their 
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scores.  Thus  there  is  no  need  for  candidate  generation;  the  word  [ni]  is 
simply  stored  with  the  list  of  words  that  can  generate  it.  Let’s  assume  this 
method  and  see  how  the  prior  and  likelihood  arc  computed  for  each  word. 

We  will  be  choosing  the  word  whose  product  of  prior  and  likelihood  is 
the  highest,  according  to  Equation  5.12,  where  y represents  the  sequence  of 
phones  (in  this  case  [ni]  and  w represents  the  candidate  word  (the,  new,  etc)). 
The  most  likely  word  is  then: 

likelihood  prior 

w = argmax  P(y\w)  P(w)  (5.12) 

wgW 

We  could  choose  to  generate  the  likelihoods  p(y\w)  by  using  a set  of 
confusion  matrices  as  we  did  for  spelling  error  correction.  But  it  turns  out 
that  confusion  matrices  don’t  do  as  well  for  pronunciation  as  for  spelling. 
While  misspelling  tends  to  change  the  form  of  a word  only  slightly,  the 
changes  in  pronunciation  between  a lexical  and  surface  form  arc  much  greater. 
Confusion  matrices  only  work  well  for  single-errors,  which,  as  we  saw  above, 
are  common  in  misspelling.  Furthermore,  recall  from  Chapter  4 that  pro- 
nunciation variation  is  strongly  affected  by  the  surrounding  phones,  lexical 
frequency,  and  stress  and  other  prosodic  factors.  Thus  probabilistic  models 
of  pronunciation  variation  include  a lot  more  factors  than  a simple  confusion 
matrix  can  include. 

One  simple  way  to  generate  pronunciation  likelihoods  is  via  proba- 
bilistic rules.  Probabilistic  rules  were  first  proposed  for  pronunciation  by 
(Labov,  1969)  (who  called  them  variable  rules).  The  idea  is  to  take  the 
rules  of  pronunciation  variation  we  saw  in  Chapter  4 and  associate  them 
with  probabilities.  We  can  then  run  these  probabilistic  rules  over  the  lexicon 
and  generate  different  possible  surface  forms  each  with  its  own  probability. 
For  example,  consider  a simple  version  of  a nasal  assimilation  rule  which 
explains  why  the  can  be  pronounced  [ni] ; a word-initial  [5]  becomes  [n]  if  the 
preceding  word  ended  in  [n]  or  sometimes  [m] : 

[.15]  5 =>-  n / [+nasal]  # (5.13) 

The  [.15]  to  the  left  of  the  rule  is  the  probability;  this  can  be  com- 
puted from  a large-enough  labeled  corpus  such  as  the  transcribed  portion  of 
Switchboard.  Let  ncount  be  the  number  of  times  lexical  [3]  is  realized  word- 
initially  by  surface  [n]  when  the  previous  word  ends  in  a nasal  (91  in  the 
Sw  itchboard  corpus).  Let  enveount  be  the  total  number  of  times  lexical  [3] 
occurs  (whatever  its  surface  realization)  when  the  previous  word  ends  in  a 
nasal  (617  in  the  Sw  itchboard  corpus).  The  resulting  probability  is: 
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P(S  — > n / [ +nasal ] # ) 


ncount 


env count 

91 

6F7 

.15 


We  can  build  similar  probabilistic  versions  of  the  assimilation  and  dele- 
tion rules  which  account  for  the  [ni]  pronunciation  of  the  other  words.  Fig- 
ure 5. 10  shows  sample  rules  and  the  probabilities  trained  on  the  Switchboard 
pronunciation  database. 


Word 

Rule  Name 

Rule 

P 

the 

nasal  assimilation 

8 =>  n / [+nasal]  # 

neat 

final  t deletion 

f =>0/V # 

need 

final  d deletion 

/V # 

new 

u fronting 

u=>i  / # [y] 

Figure  5.10  Simple  rules  of  pronunciation  variation  due  to  context  in  con- 

tinuous  speech  accounting  for  the  pronunciation  of  each  of  these  words  as  [ni]. 

We  now  need  to  compute  the  prior  probability  P(w)  for  each  word. 
For  spelling  correction  we  did  this  by  using  the  relative  frequency  of  the 
word  in  a large  corpus;  a word  which  occurred  44,000  times  in  44  million 
words  receives  the  probability  estimate  444qq())(q)()()  or  .001.  For  the  pronuncia- 
tion problem,  let’s  take  our  prior  probabilities  from  a collection  of  a written 
and  a spoken  corpus.  The  Brown  Corpus  is  a 1 million  word  collection 
of  samples  from  500  written  texts  from  different  genres  (newspaper,  nov- 
els, non-fiction,  academic,  etc.)  which  was  assembled  at  Brown  University 
in  1963-64  (Kucera  and  Francis,  1967;  Francis,  1979;  Francis  and  Kucera, 
1982).  The  Switchboard  Treebank  corpus  is  a 1.4  million  word  collection 
of  telephone  conversations.  Together  they  let  us  sample  from  both  the  writ- 
ten and  spoken  genres.  The  table  below  shows  the  probabilities  for  our  five 
words;  each  probability  is  computed  from  the  raw  frequencies  by  normaliz- 
ing by  the  number  of  words  in  the  combined  corpus  (plus  .5  * the  number  of 
word  types;  so  the  total  denominator  is  2,486,075  + 30,836): 
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w freq(w)  p(w) 


knee 

61 

.000024 

the 

114,834 

.046 

neat 

338 

.00013 

need 

1417 

.00056 

new 

2625 

.001 

Now  we  arc  almost  ready  to  answer  our  original  question:  what  is 
the  most  likely  word  given  the  pronunciation  [ni]  and  given  that  the  previous 
word  was  I at  the  beginning  of  a sentence.  Let’s  start  by  multiplying  together 
our  estimates  for  p(w)  and  p(y\w)  to  get  an  estimate;  we  show  them  sorted 
from  most  probable  to  least  probable  ( the  has  a probability  of  0 since  the 
previous  phone  was  not  [n],  and  hence  there  is  no  other  rule  allowing  [3]  to 
be  realized  as  [n]): 


Word  p(y  w)  p(w) 

p(y  w)p(w) 

new 

.36  .001 

.00036 

neat 

.52  .00013 

.000068 

need 

.11  .00056 

.000062 

knee 

1.00  .000024 

.000024 

the 

0 .046 

0 

Our  algorithm  suggests  that  new  is  the  most  likely  underlying  word. 
But  this  is  the  wrong  answer;  the  string  [ni]  following  the  word  I came  in 
fact  from  the  word  need  in  the  Switchboard  corpus.  One  way  that  people 
arc  able  to  correctly  solve  this  task  is  word-level  knowledge;  people  know 
that  the  word  string  I need ...  is  much  more  likely  than  the  word  string  I new 
....  We  don’t  need  to  abandon  our  Bayesian  model  to  handle  this  fact;  we 
just  need  to  modify  it  so  that  our  model  also  knows  that  I need  is  more  likely 
than  I new.  In  Chapter  6 we  will  see  that  we  can  do  this  by  using  a slightly 
more  intelligent  estimate  of  p(w)  called  a bigram  estimate;  essentially  we 
consider  the  probability  of  need  following  I instead  of  just  the  individual 
probability  of  need. 

This  Bayesian  algorithm  is  in  fact  paid  of  all  modern  speech  recog- 
nizers. Where  the  algorithms  differ  strongly  is  how  they  detect  individual 
phones  in  the  acoustic  signal,  and  on  which  search  algorithm  they  use  to 
efficiently  compute  the  Bayesian  probabilities  to  find  the  proper  string  of 
words  in  connected  speech  (as  we  will  see  in  Chapter  7). 
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Decision  Tree  Models  of  Pronunciation  Variation 

In  the  previous  section  we  saw  how  hand-written  rules  could  be  augmented 
with  probabilities  to  model  pronunciation  variation.  Riley  (1991)  and  With- 
gott  and  Chen  (1993)  suggested  an  alternative  to  writing  rules  by  hand, 
which  has  proved  quite  useful:  automatically  inducing  lexical-to-surface 
treesion  pronunciations  mappings  from  a labeled  corpus  with  a decision  tree,  partic- 
ularly with  the  kind  of  decision  tree  called  a Classification  and  Regression 
cart  Tree  (CART)  (Breiman  et  al,  1984).  A decision  tree  takes  a situation  de- 

scribed by  a set  of  features  and  classifies  it  into  a category  and  an  associated 
probability.  For  pronunciation,  a decision  tree  can  be  trained  to  take  a lexical 
phone  and  various  contextual  features  (surrounding  phones,  stress  and  sylla- 
ble structure  information,  perhaps  lexical  identity)  and  select  an  appropriate 
surface  phone  to  realize  it.  We  can  think  of  the  confusion  matrices  we  used 
in  spelling  error  correction  above  as  degenerate  decision  trees;  thus  the  sub- 
stitution matrix  takes  a lexical  phone  and  outputs  a probability  distribution 
over  potential  surface  phones  to  be  substituted.  The  advantage  of  decision 
frees  is  that  they  can  be  automatically  induced  from  a labeled  corpus,  and 
that  they  are  concise:  decision  trees  pick  out  only  the  relevant  features  and 
thus  suffer  less  from  sparseness  than  a matrix  which  has  to  condition  on 
every  neighboring  phone. 
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For  example,  Figure  5.11  shows  a decision  tree  for  the  pronunciation 
of  the  phoneme  /t / induced  from  the  Switchboard  corpus.  While  this  tree 
doesn’t  including  flapping  (there  is  a separate  tree  for  flapping)  it  does  model 
the  fact  that  / 1 / is  more  likely  to  be  deleted  before  a consonant  than  before 
a vowel.  Note,  in  fact,  that  the  tree  automatically  induced  the  classes  Vowel 
and  Consonant.  Furthermore  note  that  if  /t/  is  not  deleted  before  a conso- 
nant, it  is  likely  to  be  unreleased.  Finally,  notice  that  /t / is  very  unlikely  to 
be  deleted  in  syllable  onset  position. 

Readers  with  interest  in  decision  tree  modeling  of  pronunciation  should 
consult  Riley  (1991),  Withgott  and  Chen  (1993),  and  a textbook  with  an  in- 
troduction to  decision  trees  such  as  Russell  and  Norvig  (1995). 

5.9  Weighted  Automata 

We  said  earlier  that  for  purposes  of  efficiency  a lexicon  is  often  stored  with 
the  most  likely  kinds  of  pronunciation  variation  pre-compiled.  The  two  most 
common  representation  for  such  a lexicon  arc  the  trie  and  the  weighted 
finite  state  automaton/transducer  (or  probabilistic  FSA/FST)  (Pereira 
et  ah,  1994).  We  will  leave  the  discussion  of  the  trie  to  Chapter  7,  and 
concentrate  here  on  the  weighted  automaton. 

The  weighted  automaton  is  a simple  augmentation  of  the  finite  automa- 
ton in  which  each  arc  is  associated  with  a probability,  indicating  how  likely 
that  path  is  to  be  taken.  The  probability  on  all  the  arcs  leaving  a node  must 
sum  to  1.  Figure  5.12  shows  two  weighted  automata  for  the  word  tomato, 
adapted  from  Russell  and  Norvig  (1995).  The  top  automaton  shows  two  pos- 
sible pronunciations,  representing  the  dialect  difference  in  the  second  vowel. 
The  bottom  one  shows  more  pronunciations  (how  many?)  representing  op- 
tional reduction  or  deletion  of  the  first  vowel  and  optional  flapping  of  the 
final  [t]. 

A Markov  chain  is  a special  case  of  a weighted  automaton  in  which 
the  input  sequence  uniquely  determines  which  states  the  automaton  will  go 
through.  Because  they  can’t  represent  ambiguous  problems,  a Markov  chain 
is  only  useful  for  assigning  problems  to  unambiguous  sequences,  and  hence 
isn’t  often  used  in  speech  or  language  processing.  In  fact  the  weighted  au- 
tomata used  in  speech  and  language  processing  can  be  shown  to  be  equiva- 
lent to  Hidden  Markov  Models  (HMMs).  Why  do  we  introduce  weighted 
automata  in  this  chapter  and  HMMs  in  Chapter  7?  The  two  models  offer 
a different  metaphor;  it  is  sometimes  easier  to  think  about  certain  problems 
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Word  model  with  dialect  variation: 


Word  model  with  coarticulation  and  dialect  variation: 


Figure  5.12  You  say  [t  ow  m ey  t ow]  and  I say  [t  ow  m aa  t ow].  Two 
pronunciation  networks  for  the  word  tomato,  adapted  from  Russell  and  Norvig 
(1995).  The  top  one  models  sociolinguistic  variation  (some  British  or  eastern 
American  dialects);  the  bottom  one  adds  in  coarticulatory  effects.  Note  the 
correlation  between  allophonic  and  sociolinguistic  variation;  the  dialect  with 
the  vowel  [ey]  is  more  likely  to  flap  than  the  other  dialect. 


as  weighted-automata  than  as  HMMs.  The  weighted  automaton  metaphor  is 
often  applied  when  the  input  alphabet  maps  relatively  neatly  to  the  under- 
lying alphabet.  For  example,  in  the  problem  of  correcting  spelling  errors  in 
typewritten  input,  the  input  sequence  consists  of  letters  and  the  states  of  the 
automaton  can  correspond  to  letters.  Thus  it  is  natural  to  think  of  the  problem 
as  transducing  from  a set  of  symbols  to  the  same  set  of  symbols  with  some 
modifications,  and  hence  weighted  automata  are  naturally  used  for  spelling 
error  correction.  In  the  problem  of  correcting  errors  in  hand-written  input, 
the  input  sequence  is  visual,  and  the  input  alphabet  is  an  alphabet  of  lines  and 
angles  and  curves.  Here  instead  of  transducing  from  an  alphabet  to  itself,  we 
need  to  do  classification  on  some  input  sequence  before  considering  it  as 
a sequence  of  states.  Hidden  Markov  Models  provide  a more  appropriate 
metaphor,  since  they  naturally  handle  separate  alphabets  for  input  sequences 
and  state  sequences.  But  since  any  probabilistic  automaton  in  which  the  in- 
put sequence  does  not  uniquely  specify  the  state  sequence  can  be  modeled  as 
an  HMM,  the  difference  is  one  of  metaphor  rather  than  explanatory  power. 

Weighted  automata  can  be  created  in  many  ways.  One  way,  first  pro- 
posed by  Cohen  (1989)  is  to  start  with  on-line  pronunciation  dictionaries  and 
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use  hand-written  rules  of  the  kind  we  saw  above  to  create  different  potential 
surface  forms.  The  probabilities  can  then  be  assigned  either  by  counting 
the  number  of  times  each  pronunciation  occurs  in  a corpus,  or  if  the  cor- 
pus is  too  sparse,  by  learning  probabilities  for  each  rule  and  multiplying 
out  the  rule  probabilities  for  each  surface  form  (Tajchman  el  al,  1995).  Fi- 
nally these  weighted  rules,  or  alternatively  the  decision  trees  we  discussed 
in  the  last  section,  can  be  automatically  compiled  into  a weighted  finite-state 
transducer  (Sproat  and  Riley,  1996).  Alternatively,  for  very  common  words, 
we  can  simply  find  enough  examples  of  the  pronunciation  in  a transcribed 
corpus  to  build  the  model  by  just  combining  all  the  pronunciations  into  a 
network  (Wooters  and  Stolcke,  1994). 

The  networks  for  tomato  above  were  shown  merely  as  illustration  and 
arc  not  from  any  real  system;  Figure  5.13  shows  an  automaton  for  the  word 
about  which  is  trained  on  actual  pronunciations  from  the  Switchboard  corpus 
(we  discussed  these  pronunciations  in  Chapter  4). 


Computing  Likelihoods  from  Weighted  Automata:  The  Forward  Algorithm 

One  advantage  of  an  automaton-based  lexicon  is  that  there  arc  efficient  al- 
gorithms for  generating  the  probabilities  that  arc  needed  to  implement  the 
Bayesian  method  of  correct-word-identification  of  Section  5.8.  These  algo- 
rithms apply  to  weighted  automata  and  also  to  the  Hidden  Markov  Models 
that  we  will  discuss  in  Chapter  7.  Recall  that  in  our  example  the  Bayesian 
method  is  given  as  input  a series  of  phones  [n  iy],  and  must  choose  between 
the  words  the,  neat,  need,  new,  and  knee.  This  was  done  by  computing  two 
probabilities:  the  prior  probability  of  each  word,  and  the  likelihood  of  the 
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phone  string  [n  iy]  given  each  word.  When  we  discussed  this  example  ear- 
lier, we  said  that  for  example  the  likelihood  of  [n  iy]  given  the  word  need  was 
.11,  since  we  computed  a probability  of . 1 1 for  the  final-d-deletion  rule  from 
our  Switchboard  corpus.  This  probability  is  transparent  for  need  since  there 
were  only  two  possible  pronunciations  ([n  iy]  and  [n  iy  d]).  But  for  words 
like  about , visualizing  the  different  probabilities  is  more  complex.  Using  a 
precompiled  weighted  automata  can  make  it  simpler  to  see  all  the  different 
probabilities  of  different  paths  through  the  automaton. 

There  is  a very  simple  algorithm  for  computing  the  likelihood  of  a 
string  of  phones  given  the  weighted  automaton  for  a word.  This  algorithm, 
forward  the  forward  algorithm,  is  an  essential  paid  of  ASR  systems,  although  in  this 
chapter  we  will  only  be  working  with  a simple  usage  of  the  algorithm.  This  is 
because  the  for  ward  algorithm  is  particularly  useful  when  there  are  multiple 
paths  through  an  automaton  which  can  account  for  the  input;  this  is  not  the 
case  in  the  weighted  automata  in  this  chapter,  but  will  be  true  for  the  HMMS 
of  Chapter  7.  The  forward  algorithm  is  also  an  important  step  in  defining  the 
Viterbi  algorithm  which  we  will  see  later  in  this  chapter. 

Let’s  begin  by  giving  a formal  definition  of  a weighted  automaton  and 
of  the  input  and  output  to  the  likelihood  computation  problem.  A weighted 
automaton  consists  of 

1 . a sequence  of  states  q = (r/o<7 1 r/2  ...#«),  each  corresponding  to  a phone, 

2.  a set  of  transition  probabilities  between  states,  aot,ai2,at3,  encoding 
the  probability  of  one  phone  following  another. 

We  represent  the  states  as  nodes,  and  the  transition  probabilities  as 
edges  between  nodes;  an  edge  exists  between  two  nodes  if  there  is  a non-zero 
transition  probability  between  the  two  nodes.4  The  sequences  of  symbols 
that  arc  input  to  the  model  (if  we  arc  thinking  of  it  as  recognizer)  or  which  arc 
produced  by  the  model  (if  we  arc  thinking  of  it  as  a generator)  arc  generally 
sequenceon  called  the  observation  sequence,  referred  to  as  O = (o\Oi_o-\  ...ot).  (Upper- 
case letters  arc  used  for  a sequence  and  lower-case  letters  for  an  individual 

4 We  have  used  two  ‘special'  states  (often  called  non-emitting  states)  as  the  start  and  end 
state;  it  is  also  possible  to  avoid  the  use  of  these  states.  In  that  case,  an  automaton  must 
specify  two  more  things: 

1 . K,  an  initial  probability  distribution  over  states,  such  that  7T,-  is  the  probability  that  the 
automaton  will  start  in  state  i.  Of  course  some  states  j may  have  7t  ; = 0,  meaning  that 
they  cannot  be  initial  states. 


2.  a set  of  legal  accepting  states. 
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element  of  a sequence).  We  will  use  this  terminology  when  talking  about 
weighted  automata  and  later  when  talking  about  HMMs. 

Figure  5.14  shows  an  automaton  for  the  word  need  with  a sample  ob- 
servation sequence. 


a24  = .11 


Word  Model 

Observation 
Sequence 
(phone  symbols) 

Oi  o2  o3 


Figure  5.14  A simple  weighted  automaton  or  Markov  chain  pronunciation 
network  for  the  word  need,  showing  the  transition  probabilities,  and  a sample 
observation  sequence.  The  transition  probabilities  axy  between  two  states  x 
and  y are  1.0  unless  otherwise  specified. 


This  task  of  determining  which  underlying  word  might  have  produced 
an  observation  sequence  is  called  the  decoding  problem.  Recall  that  in  order  decoding 
to  find  which  of  the  candidate  words  was  most  probable  given  the  observa- 
tion sequence  [n  iy],  we  need  to  compute  the  product  P(0  w)P(w)  for  each 
candidate  word  (the,  need,  neat,  knee,  new),  i.e.  the  likelihood  of  the  ob- 
servation sequence  O given  the  word  w times  the  prior  probability  of  the 
word. 

The  forward  algorithm  can  be  run  to  perform  this  computation  for  each 
word;  we  give  it  an  observation  sequence  and  the  pronunciation  automaton 
for  a word  and  it  will  return  P(0\w)P(w).  Thus  one  way  to  solve  the  de- 
coding problem  is  to  run  the  forward  algorithm  separately  on  each  word  and 
choose  the  word  with  the  highest  value.  As  we  saw  earlier,  the  Bayesian 
method  produces  the  wrong  result  for  pronunciation  [n  iy]  as  paid  of  the 
word  sequence  I need  (its  first  choice  is  the  word  new,  and  the  second  choice 
is  neat:  need  is  only  the  third  choice).  Since  the  forward  algorithm  is  just 
a way  of  implementing  the  Bayesian  approach,  it  will  return  the  exact  same 
rankings.  (We  will  see  in  Chapter  6 how  to  augment  the  algorithm  with  bi- 
gram probabilities  which  will  enable  it  to  make  use  of  the  knowledge  that 
the  previous  word  was  /). 

The  forward  algorithm  takes  as  input  a pronunciation  network  for  each 
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candidate  word.  Because  the  word  the  only  has  the  pronunciation  [n  iy]  after 
nasals,  and  since  we  are  assuming  the  actual  context  of  this  word  was  after 
the  word  I (no  nasal),  we  will  skip  that  word  and  look  only  at  new,  neat, 
need,  and  knee.  Note  in  Figure  5.15  that  we  have  augmented  each  network 
with  the  probability  of  each  word,  computed  from  the  frequency  that  we  saw 
on  page  165. 


.11 


Word  model  for  "need"  Word  model  for  "knee" 


.52 


Word  model  for  "neat" 

Word  model  for  "new" 


Figure  5.15  Pronunciation  networks  for  the  words  need,  neat,  new,  and 
knee.  All  networks  are  simplified  from  the  actual  pronunciations  in  the  Switch- 
board corpus.  Each  network  has  been  augmented  by  the  unigram  probability 
of  the  word  (i.e.  its  normalized  frequency  from  the  Switchboard+Brown  cor- 
pus). Word  probabilities  are  not  usually  included  as  part  of  the  pronunciation 
network  for  a word;  they  are  added  here  to  simplify  the  exposition  of  the  for- 
ward algorithm. 


The  forward  algorithm  is  another  dynamic  programming  algorithm, 
and  can  be  thought  of  as  a slight  generalization  of  the  minimum  edit  dis- 
tance algorithm.  Like  the  minimum  edit  distance  algorithm,  it  uses  a table 
to  store  intermediate  values  as  it  builds  up  the  probability  of  the  observa- 
tion sequence.  Unlike  the  minimum  edit  distance  algorithm,  the  rows  arc 
labeled  not  just  by  states  which  always  occur  in  linear  order,  but  implicitly 
by  a state-graph  which  has  many  ways  of  getting  from  one  state  to  another. 
In  the  minimum  edit  distance  algorithm,  we  filled  in  the  matrix  by  just  com- 
puting the  value  of  each  cell  from  the  3 cells  around  it.  With  the  forward 
algorithm,  on  the  other  hand,  a state  might  be  entered  by  any  other  state, 
and  so  the  recurrence  relation  is  somewhat  more  complicated.  Furthermore, 
the  forward  algorithm  computes  the  sum  of  the  probabilities  of  all  possible 
paths  which  could  generate  the  observation  sequence,  where  the  minimum 
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edit  distance  computed  the  minimum  such  probability.5  Each  cell  of  the  for- 
ward algorithm  matrix,  forwcird[t . j\  represents  the  probability  of  being  in 
state  j after  seeing  the  first  t observations,  given  the  automaton  A,.  Since 
we  have  augmented  our  graphs  with  the  word  probability  p(w),  our  exam- 
ple of  the  forward  algorithm  here  is  actually  computing  this  likelihood  times 
p(w).  The  value  of  each  cell  forwarcl[t . j]  is  computed  by  summing  over  the 
probabilities  of  every  path  that  could  lead  us  to  this  cell.  Formally,  each  cell 
expresses  the  following  probability: 


forward^,  j]  = P{o\.o2 . ..ot,q,  = j\X)  P{w) 


(5.14) 


Here  qt  = j means  ‘the  probability  that  the  f’th  state  in  the  sequence 
of  states  is  state  /.  We  compute  this  probability  by  summing  over  the  ex- 
tensions of  all  the  paths  that  lead  to  the  current  cell.  An  extension  of  a path 
from  a state  i at  time  7 — 1 is  computed  by  multiplying  the  following  three 
factors: 

1.  the  previous  path  probability  from  the  previous  cell  forward  [t  — 1,/]. 

2.  the  transition  probability  a,j  from  previous  state  i to  current  state  j. 

3.  the  observation  likelihood  bp  that  current  state  j matches  observation 
symbol  t.  For  the  weighted  automata  that  we  consider  here,  bp  is  1 if 
the  observation  symbol  matches  the  state,  and  0 otherwise.  Chapter  7 
will  consider  more  complex  observation  likelihoods. 


The  algorithm  is  described  in  Figure  5.16. 

Figure  5.17  shows  the  forward  algorithm  applied  to  the  word  need.  The 
algorithm  applies  similarly  to  the  other  words  which  can  produce  the  string 
[n  iy],  resulting  in  the  probabilities  on  page  165.  In  order  to  compute  the 
most  probable  underlying  word,  we  run  the  forward  algorithm  separately  on 
each  of  the  candidate  words,  and  choose  the  one  with  the  highest  probabil- 
ity. Chapter  7 will  give  further  details  of  the  mathematics  of  the  forward 
algorithm  and  introduce  the  related  forward-backward  algorithm. 


5 The  forward  algorithm  computes  the  sum  because  there  may  be  multiple  paths  through 
the  network  which  explain  a given  observation  sequence.  Chapter  7 will  take  up  this  point  in 
more  detail. 


174 


Chapter  5.  Probabilistic  Models  of  Pronunciation  and  Spelling 


function  FORWARD(obser\’ations,  state-graph)  returns  forward-probability 

nu?n-states  <—  NUM-OF-STATES(state-graph) 
num-obs  <—  length(observations) 

Create  probability  matrix  forward[num-states  + 2,  num-obs  + 2] 
forward[0,0]  -<—1.0 

for  each  time  step  t from  0 to  num-obs  do 
for  each  state  s from  0 to  num-states  do 

for  each  transition  s'  from  s specified  by  state-graph 
forward[s'  ,t+ 1]  forward[s,t ] * a[s,  s']  * b[s',  ot ] 

return  the  sum  of  the  probabilities  in  the  final  column  of  forward 


Figure  5.16  The  forward  algorithm  for  computing  likelihood  of  observa- 
tion sequence  given  a word  model,  a [.?,■?']  is  the  transition  probability  from 
current  state  ,v  to  next  state  s'  and  b[s',ot]  is  the  observation  likelihood  of  s’ 
given  ot.  For  the  weighted  automata  that  we  consider  here,  b[s',ot]  is  1 if  the 
observation  symbol  matches  the  state,  and  0 otherwise. 
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Figure  5.17  The  forward  algorithm  applied  to  the  word  need , computing 
the  probability  P(0\w)P(w).  While  this  example  doesn’t  require  the  full  power 
of  the  forward  algorithm,  we  will  see  its  use  on  more  complex  examples  in 
Chapter  7. 


Decoding:  The  Viterbi  Algorithm 

The  forward  algorithm  as  we  presented  it  seems  a bit  of  an  overkill.  Since 
only  one  path  through  the  pronunciation  networks  will  match  the  input  string, 
why  use  such  a big  matrix  and  consider  so  many  possible  paths?  Further- 
more, as  a decoding  method,  it  seems  rather  inefficient  to  run  the  forward 
algorithm  once  for  each  word  (imagine  how  inefficient  this  would  be  if  we 
were  computing  likelihoods  for  all  possible  sentences  rather  than  all  possible 
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words!)  Part  of  the  reason  that  the  forward  algorithm  seems  like  overkill  is 
that  we  have  immensely  simplified  the  pronunciation  problem  by  assuming 
that  our  input  consists  of  sequences  of  unambiguous  symbols.  We  will  see  in 
Chapter  7 that  when  the  observation  sequence  is  a set  of  noisy  acoustic  val- 
ues. there  arc  many  possibly  paths  through  the  automaton,  and  the  forward 
algorithm  will  play  an  important  role  in  summing  these  paths. 

But  it  is  true  that  having  to  run  it  separately  on  each  word  makes  the 
forward  algorithm  a very  inefficient  decoding  method.  Luckily,  there  is  a 
simple  variation  on  the  forward  algorithm  called  the  Viterbi  algorithm  which 
allows  us  to  consider  all  the  words  simultaneously  and  still  compute  the  most 
likely  path.  The  term  Viterbi  is  common  in  speech  and  language  process- 
ing, but  like  the  forward  algorithm  this  is  really  a standard  application  of 
the  classic  dynamic  programming  algorithm,  and  again  looks  a lot  like  the 
minimum  edit  distance  algorithm.  The  Viterbi  algorithm  was  first  applied 
to  speech  recognition  by  Vintsyuk  (1968),  but  has  what  Kruskal  (1983)  calls 
a ‘remarkable  history  of  multiple  independent  discovery  and  publication’; 
see  the  History  section  at  the  end  of  the  chapter  for  more  details.  The  name 
Viterbi  is  the  one  which  is  most  commonly  used  in  speech  recognition,  al- 
though the  terms  DP  alignment  (for  Dynamic  Programming  alignment), 
dynamic  time  warping  and  one-pass  decoding  are  also  commonly  used. 
The  term  is  applied  to  the  decoding  algorithm  for  weighted  automata  and 
Hidden  Markov  Models  on  a single  word  and  also  to  its  more  complex  ap- 
plication to  continuous  speech,  as  we  will  see  in  Chapter  7.  In  this  chapter 
we  will  show  how  the  algorithm  is  used  to  find  the  best  path  through  a net- 
work composed  of  single  words,  as  a result  choosing  the  word  which  is  most 
probable  given  the  observation  sequence  string  of  words. 

The  version  of  the  Viterbi  algorithm  that  we  will  present  takes  as  input 
a single  weighted  automaton  and  a set  of  observed  phones  o = (010203  ...ot) 
and  returns  the  most  probable  state  sequence  q = {q\qiqi  • • .<//),  together 
with  its  probability.  We  can  create  a single  weighted  automaton  by  combin- 
ing the  pronunciation  networks  for  the  four  words  in  parallel  with  a single 
start  and  a single  end  state.  Figure  5.18  shows  the  combined  network. 

Figure  5.19  shows  pseudocode  for  the  Viterbi  algorithm.  Like  the  min- 
imum edit  distance  and  forward  algorithm,  the  Viterbi  algorithm  sets  up  a 
probability  matrix,  with  one  column  for  each  time  index  t and  one  row  for 
each  state  in  the  state  graph.  Also  like  the  forward  algorithm,  each  column 
has  a cell  for  each  state  q,  in  the  single  combined  automaton  for  the  four 
words.  In  fact,  the  code  for  the  Viterbi  algorithm  should  look  exactly  like 
the  code  for  the  forward  algorithm  with  two  modifications.  First,  where  the 
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forward  algorithm  places  the  sum  of  all  previous  paths  into  the  current  cell, 
the  Viterbi  algorithm  puts  the  max  of  the  previous  paths  into  the  current  cell. 

The  algorithm  first  creates  N + 2 or  four  state  columns.  The  first  col- 
umn is  an  initial  pseudo-observation,  the  second  corresponds  to  the  first 
observation  phone  [n],  the  third  to  [iy]  and  the  fourth  to  a final  pseudo- 
observation. We  begin  in  the  first  column  by  setting  the  probability  of  the 
start  state  to  1.0,  and  the  other  probabilities  to  0;  the  reader  should  find  this 
in  Figure  5.20.  Cells  with  probability  0 arc  simply  left  blank  for  readability. 

Then  we  move  to  the  next  state;  as  with  the  forward  algorithm,  for 
every  state  in  column  0,  we  compute  the  probability  of  moving  into  each 
state  in  column  1.  The  value  viterbi  \t . j is  computed  by  taking  the  maximum 
over  the  extensions  of  all  the  paths  that  lead  to  the  current  cell.  An  extension 
of  a path  from  a state  i at  time  / — I is  computed  by  multiplying  the  same 
three  factors  we  used  for  the  forward  algorithm: 

1.  the  previous  path  probability  from  the  previous  cell  forward[t  — 1,/]. 

2.  the  transition  probability  a,;  from  previous  state  i to  current  state  j. 

3.  the  observation  likelihood  bp  that  current  state  j matches  observation 
symbol  t.  For  the  weighted  automata  that  we  consider  here,  bp  is  1 if 
the  observation  symbol  matches  the  state,  and  0 otherwise.  Chapter  7 
will  consider  more  complex  observation  likelihoods. 


Section  5.9.  Weighted  Automata 


111 


function  V iter ^{observations  of  len  T,state-grapli)  returns  best-path 

num-states  <— NUM-OF-STATES(sfafe-grap/z) 

Create  a path  probability  matrix  viterbi[num-states+2,T+2] 

viterbi[0,0]  <—1.0 

for  each  time  step  t from  0 to  '/  do 

for  each  state  .v  from  0 to  num-states  do 

for  each  transition  s'  from  s specified  by  state-graph 
new-score  ^viterbi[s,  t]  * a[s,s']  * bsi(ot) 
if  ((viterbi\s',t+l]  = 0)  ||  ( new-score  > viterbi[s',  t+1 ])) 
then 

viterbi[s',  t+1]  <— new-score 
back-pointer[s' , t+1]  <—  s 

Backtrace  from  highest  probability  state  in  the  final  column  of  viterbi[]  and 
return  path 


Figure  5.19  Viterbi  algorithm  for  finding  optimal  sequence  of  states  in  con- 
tinuous speech  recognition,  simplified  by  using  phones  as  inputs.  Given  an 
observation  sequence  of  phones  and  a weighted  automaton  ( state  graph),  the 
algorithm  returns  the  path  through  the  automaton  which  has  maximum  proba- 
bility and  accepts  the  observation  sequence,  c/  [,v . .v']  is  the  transition  probability 
from  current  state  s to  next  state  s'  and  b\s',ot]  is  the  observation  likelihood  of 
s’  given  ot.  For  the  weighted  automata  that  we  consider  here,  b[s',ot]  is  1 if 
the  observation  symbol  matches  the  state,  and  0 otherwise. 


In  Figure  5.20,  in  the  column  for  the  input  n,  each  word  starts  with  [n], 
and  so  each  has  a non-zero  probability  in  the  cell  for  the  state  n.  Other  cells 
in  that  column  have  zero  entries,  since  their  states  don’t  match  n.  When  we 
proceed  to  the  next  column,  each  cell  that  matches  iy  gets  updated  with  the 
contents  of  the  previous  cell  times  the  transition  probability  to  that  cell.  Thus 
the  value  of  viterbi\2,\ynew  \ for  the  iy  state  of  the  word  new  is  the  product  of 
the  ‘word’  probability  of  new  times  the  probability  of  new  being  pronounced 
with  the  vowel  iy.  Notice  that  if  we  look  only  at  this  iy  column,  that  the  word 
need  is  currently  the  ‘most-probable’  word.  But  when  we  move  to  the  final 
column,  the  word  new  will  win  out,  since  need  has  a smaller  transition  prob- 
ability to  end  (.11)  than  new  does  (1.0).  We  can  now  follow  the  backpointers 
and  backtrace  to  find  the  path  that  gave  us  this  final  probability  of  .00036. 
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Figure  5.20  The  entries  in  the  individual  state  columns  for  the  Viterbi  al- 
gorithm. Each  cell  keeps  the  probability  of  the  best  path  so  far  and  a pointer 
to  the  previous  cell  along  that  path.  Backtracing  from  the  end  state,  we  can 
reconstruct  the  state  sequence  nnew  iynew,  arriving  at  the  best  word  new. 


Weighted  Automata  and  Segmentation 

Weighted  automata  and  the  Viterbi  algorithm  play  an  important  in  various 
algorithm  for  segmentation.  Segmentation  is  the  process  of  taking  an  undif- 
ferentiated sequence  of  symbols  and  ‘segmenting’  it  into  chunks.  For  exam- 
ple sentence  segmentation  is  the  problem  of  automatically  finding  the  sen- 
tence boundaries  in  a corpus.  Similarly  word  segmentation  is  the  problem 
of  finding  word-boundaries  in  a corpus.  In  written  English  there  is  no  dif- 
ficulty in  segmenting  words  from  each  other  because  there  arc  orthographic 
spaces  between  words.  This  is  not  the  case  in  languages  like  Chinese  and 
Japanese  that  use  a Chinese-derived  writing  system.  Written  Chinese  does 
not  mark  word  boundaries.  Instead,  each  Chinese  character  is  written  one 
after  the  other  without  spaces.  Since  each  character  approximately  repre- 
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sents  a single  morpheme,  and  since  words  can  be  composed  of  one  or  more 
characters,  it  is  often  difficult  to  know  where  words  should  be  segmented. 
Proper  word-segmentation  is  necessary  for  many  applications,  particularly 
including  parsing  and  text-to-speech  (how  a sentence  is  broken  up  into  words 
influences  its  pronunciation  in  a number  of  ways). 

Consider  the  following  example  sentence  from  Sproat  el  al.  (1996): 

(5.15) 

“How  do  you  say  ‘octopus’  in  Japanese?” 

This  sentence  has  two  potential  segmentations,  only  one  of  which  is 
correct.  In  the  plausible  segmentation,  the  first  two  characters  arc  combined 
to  make  the  word  for  ‘Japanese  language’  ( □ A ri-wen)  (the  accents  indicate 
the  tone  of  each  syllable),  and  the  next  two  are  combined  to  make  the  word 
for  ‘octopus’  (#fii  zhang-yu). 

(5.16)  B3t  £JS  M ? 

ri-wen  zhang-yu  zen-me  shuo 

Japanese  octopus  how  say 

‘How  do  you  say  octopus  in  Japanese?’ 

(5.17)  B & £JS  M ? 

ri  wen-zhang  yu  zen-me  shuo 

Japan  essay  fish  how  say 

‘How  do  you  say  Japan  essay  fish?’ 

Sproat  et  al.  (1996)  give  a very  simple  algorithm  which  selects  the 
correct  segmentation  by  choosing  the  one  which  contains  the  most-frequent 
words.  In  other  words,  the  algorithm  multiplies  together  the  probabilities  of 
each  word  in  a potential  segmentation  and  chooses  whichever  segmentation 
results  in  a higher  product  probability. 

The  implementation  of  their  algorithm  combines  a weighted-finite- 
state  transducer  representation  of  a Chinese  lexicon  with  the  Viterbi  algo- 
rithm. This  lexicon  is  a slight  augmentation  of  the  FST  lexicons  we  saw 
in  Chapter  4;  each  word  is  represented  as  a series  of  arcs  representing  each 
character  in  the  word,  followed  by  a weighted  arc  representing  the  proba- 
bility of  the  word.  As  is  commonly  true  with  probabilistic  algorithms,  they 
actually  use  the  negative  log  probability  of  the  word  (— log(P(w)).  The  log 
probability  is  mainly  useful  because  the  product  of  many  probabilities  gets 
very  small,  and  so  using  the  log  probability  can  help  avoid  underflow.  Using 
log  probabilities  also  means  that  we  arc  adding  costs  rather  than  multiplying 
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probabilities,  and  that  we  arc  looking  for  the  minimum  cost  solution  rather 
than  the  maximum  probability  solution. 

Consider  the  example  in  Figure  5.21.  This  sample  lexicon  Figure  5.21(a) 
consists  of  only  5 potential  words: 

Word  Pronunciation  Meaning  Cost  (—logp(w)) 


ri-wen 

‘Japanese’ 

10.63 

ri 

‘Japan’ 

6.51 

zhang-  yu 

‘octopus’ 

13.18 

wen-zhang 

‘essay’ 

9.51 

yu 

‘fish’ 

10.28 

The  system  represents  the  input  sentence  as  the  unweighted  FSA  in 
Figure  5.21(b).  In  order  to  compose  this  input  with  the  lexicon,  it  needs  to 
be  converted  into  an  FST.  The  algorithm  uses  a function  Id  which  takes  an 
FSA  A and  returns  the  FST  which  maps  all  and  only  the  strings  accepted 
by  A to  themselves.  Let  D*  represent  the  transitive  closure  of  D,  i.e.  the 
automaton  created  by  adding  a loop  from  the  end  of  the  lexicon  back  to 
the  beginning.  The  set  of  all  possible  segmentations  is  Id  (I)  o D*,  i.e.  the 
input  transducer  Id (/)  composed  with  the  transitive  closure  of  the  dictionary 
D,  shown  in  Figure  5.21(c).  Then  the  best  segmentation  is  the  lowest-cost 
segmentation  in  Id(I)  o D*,  shown  in  Figure  5.21(d). 

Finding  the  best  path  shown  in  Figure  5.21(d)  can  be  done  easily  with 
the  Viterbi  algorithm  and  is  left  as  an  exercise  for  the  reader. 

This  segmentation  algorithm,  like  the  spelling  error  correction  algo- 
rithm we  saw  earlier,  can  also  be  extended  to  incorporate  the  cross-word 
probabilities  (/V-gram  probabilities)  that  will  be  introduced  in  Chapter  6. 


5.10  Pronunciation  in  Humans 

Section  5.7  discussed  many  factors  which  influence  pronunciation  variation 
in  humans.  In  this  section  we  very  briefly  summarize  a computational  model 
of  the  retrieval  of  words  from  the  mental  lexicon  as  paid  of  human  lexical 
production.  The  model  is  due  to  Gary  Dell  and  his  colleagues;  for  brevity 
we  combine  and  simplify  features  of  multiple  models  (Dell,  1986,  1988; 
Dell  et  ah,  1997)  in  this  single  overview.  First  consider  some  data.  As 
we  suggested  in  Chapter  3,  production  errors  such  as  slips  of  the  tongue 
(darn  bore  instead  barn  door ) often  provide  important  insights  into  lexical 
production.  Dell  (1986)  summarizes  a number  of  previous  results  about  such 
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(a)  Dictionary  D 


: wen/0.000 


(b)  Input  I 


ri  wen 


(c)  ld(D)  o D* 


: ri/0.000 


e*  e/6  51  : wen/0.000  : zhang/0.000  e:  e/9.51  : yu/0.000 

D — ©— © — © 


e:  e/10.28 


© © 

e:  e/10.63  : zhang/0.000  : yu/0.000 


(d)  BestPath(ld(D)  o D*) 


: ri/0.000  : wen/0.000  e:  e/10.63  : zhang/0.000  : yu/0.000  e:  e/13.18 

©©— © — ©— -© — ©>— © — © 


Figure  5.21  The  Sproat  et  al.  (1996)  algorithm  applied  to  four  input  words 
(after  Sproat  et  al.  (1996)) 


slips.  The  lexical  bias  effect  is  that  slips  are  more  likely  to  create  words  than 
non-words;  thus  slips  like  dean  bad — > bean  dad  arc  three  times  more  likely 
than  slips  like  deal  backet  beal  dack.  The  repeated-phoneme  bias  is  that 
two  phones  in  two  words  arc  likely  to  participate  in  an  error  if  there  is  an 
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identical  phone  in  both  words.  Thus  deal  beack  is  more  likely  to  slip  to  beal 
than  deal  back  is. 

The  model  which  Dell  (1986,  1988)  proposes  is  a network  with  3 lev- 
els: semantics,  word  (lemma),  and  phonemes.6  The  semantics  level  has 
nodes  for  concepts,  the  lemma  level  has  one  node  for  each  words,  and  the 
phoneme  level  has  separate  nodes  for  each  phone,  separated  into  onsets, 
vowels,  and  codas.  Each  lemma  node  is  connected  to  the  phoneme  units 
which  comprise  the  word,  and  the  semantic  units  which  represent  the  con- 
cept. Connections  arc  used  to  pass  activation  from  node  to  node,  and  arc 
bidirectional  and  excitatory.  Lexical  production  happens  in  two  stages.  In 
the  first  stage,  activation  passes  from  the  semantic  concepts  to  words.  Ac- 
tivation will  cascade  down  into  the  phonogical  units  and  then  back  up  into 
other  word  units.  At  some  point  the  most  highely  activated  word  is  selected. 
In  the  second  stage,  this  selected  is  given  a large  jolt  of  activation.  Again 
this  activation  passes  to  the  phonological  level.  Now  the  most  highly  active 
phoneme  nodes  arc  selected  and  accessed  in  order. 

Figure  5.22  shows  Dell’s  model.  Errors  occur  because  too  much  acti- 
vation reaches  the  wrong  phonological  node.  Lexical  bias,  for  example,  is 
modeled  by  activation  spreading  up  from  the  phones  of  the  intended  word  to 
neighboring  words,  which  then  activated  their  own  phones.  Thus  incorrect 
phones  get  'extra’  activation  if  they  are  present  in  actual  words. 

The  two-step  network  model  also  explains  other  facts  about  lexical 
aphasic  production.  Aphasic  speakers  have  various  troubles  in  language  production 
and  comprehension,  often  caused  by  strokes  or  accidents.  Dell  el  al.  (1997) 
show  that  weakening  various  connections  in  a network  model  like  the  one 
above  can  also  account  for  the  speech  errors  in  aphasics.  This  supports  the 
continuity  hypothesis , which  suggests  that  some  part  of  aphasia  is  merely  an 
extension  of  normal  difficulties  in  word  retrieval,  and  also  provides  further 
evidence  for  the  network  model.  Readers  interested  in  details  of  the  model 
should  see  the  above  references  and  related  computational  models  such  as 
Roelofs  (1997),  which  extends  the  network  model  to  deal  with  syllabifica- 
tion, phonetic  encoding,  and  more  complex  sequential  structure,  and  Levelt 
et  al.  (1999). 


6 Dell  (1988)  also  has  a fourth  level  for  syllable  structure  which  we  will  ignore  here. 
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Semantics 


Words 

(Lemmas) 


Onsets  Vowels  Codas 


Figure  5.22  The  network  model  of  Dell  (1986,  1988),  showing  the  mech- 
anism for  lexical  bias  (modified  from  Dell  (1988,  p.  134)).  The  boldfaced 
nodes  indicate  nodes  with  lots  of  activation.  The  intended  word  dean  has  a 
greater  chance  of  slipping  to  bean  because  of  the  existence  of  the  bean  node. 
The  boldfaced  lines  show  the  connections  which  account  for  the  possible  slip. 


5.11  Summary 


This  chapter  has  introduced  some  essential  metaphors  and  algorithms  that 
will  be  useful  throughout  speech  and  language  processing.  The  main  points 
arc  as  follows: 

• We  can  represent  many  language  problems  as  if  a clean  string  of  sym- 
bols had  been  corrupted  by  passing  through  a noisy  channel  and  it  is 
our  job  to  recover  the  original  symbol  string.  One  powerful  way  to 
recover  the  original  symbol  string  is  to  consider  all  possible  original 
strings,  and  rank  them  by  their  conditional  probability. 

• The  conditional  probability  is  usually  easiest  to  compute  using  the 
Bayes  Rule,  which  breaks  down  the  probability  into  a prior  and  a 
likelihood.  For  spelling  error  correction  or  pronunciation-modeling, 
the  prior  is  computed  by  taking  word  frequencies  or  word  bigram  fre- 
quencies. The  likelihood  is  computed  by  training  a simple  probabilistic 
model  (like  a confusion  matrix,  a decision  tree,  or  a hand-written  rule) 
on  a database. 
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• The  task  of  computing  the  distance  between  two  strings  comes  up 
in  spelling  error  correction  and  other  problems.  The  minimum  edit 
distance  algorithm  is  an  application  of  the  dynamic  programming 
paradigm  to  solving  this  problem,  and  can  be  used  to  produce  the  dis- 
tance between  two  strings  or  an  alignment  of  the  two  strings. 

• The  pronunciation  of  words  is  very  variable.  Pronunciation  variation 
is  caused  by  two  classes  of  factors:  lexical  variation  and  allophonic 
variation.  Lexical  variation  includes  sociolinguistic  factors  like  di- 
alect and  register  or  style. 

• The  single  most  important  factor  affecting  allophonic  variation  is  the 
identity  of  the  surrounding  phones.  Other  important  factors  include 
syllable  structure,  stress  patterns,  and  the  identity  and  frequency  of  the 
word. 

• The  decoding  task  is  the  problem  of  finding  determining  the  correct 
‘underlying’  sequence  of  symbols  that  generated  the  ‘noisy’  sequence 
of  observation  symbols. 

• The  forward  algorithm  is  an  efficient  way  of  computing  the  likeli- 
hood of  an  observation  sequence  given  a weighted  automata.  Like  the 
minimum  edit  distance  algorithm,  it  is  a variant  of  dynamic  program- 
ming. It  will  prove  particularly  in  Chapter  7 when  we  consider  Hidden 
Markov  Models,  since  it  will  allow  us  to  sum  multiple  paths  that  each 
account  for  the  same  observation  sequence. 

• The  Viterbi  algorithm,  another  variant  of  dynamic  programming,  is 
an  efficient  way  of  solving  the  decoding  problem  by  considering  all 
possible  strings  and  using  the  Bayes  Rule  to  compute  their  probabilities 
of  generating  the  observed  ‘noisy’  sequence. 

• Word  segmentation  in  languages  without  word-boundary  markers, 
like  Chinese  and  Japanese,  is  another  kind  of  optimization  task  which 
can  be  solved  by  the  Viterbi  algorithm. 


Bibliographical  and  Historical  Notes 

Algorithms  for  spelling  error  detection  and  correction  have  existing  since 
at  least  Blair  (1960).  Most  early  algorithm  were  based  on  similarity  keys 
like  the  Soundex  algorithm  discussed  in  the  exercises  on  page  89  (Odell  and 
Russell,  1922;  Knuth,  1973).  Damerau  (1964)  gave  a dictionary-based  al- 
gorithm for  error  detection;  most  error-detection  algorithms  since  then  have 
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been  based  on  dictionaries.  Damerau  also  gave  a correction  algorithm  that 
worked  for  single  errors.  Most  algorithms  since  then  have  relied  on  dynamic 
programming,  beginning  with  Wagner  and  Fischer  (1974)  (see  below).  Ku- 
kich  (1992)  is  the  definitive  survey  article  on  spelling  error  detection  and 
correction.  Only  much  later  did  probabilistic  algorithms  come  into  vogue 
for  non-OCR  spelling-error  correction  (for  example  Kashyap  and  Oonmien 
(1983)  and  Kernighan  el  al.  (1990)). 

By  contrast,  the  field  of  optical  character  recognition  developed  prob- 
abilistic algorithms  quite  early;  Bledsoe  and  Browning  (1959)  developed  a 
probabilistic  approach  to  OCR  spelling  error  collection  that  used  a large  dic- 
tionary and  computed  the  likelihood  of  each  observed  letter  sequence  given 
each  word  in  the  dictionary  by  multiplying  the  likelihoods  for  each  letter. 
In  this  sense  Bledsoe  and  Browning  also  prefigured  the  modern  Bayesian 
approaches  to  speech  recognition.  (Shinghal  and  Toussaint,  1979)  and  (Hull 
and  Srihari,  1982)  applied  bigram  letter-transition  probabilities  and  the  Viterbi 
algorithm  to  choose  the  most  likely  collect  form  for  a misspelled  OCR  input. 

The  application  of  dynamic  programming  to  the  problem  of  sequence 
comparison  has  what  Kruskal  (1983)  calls  a ‘remarkable  history  of  multiple 
independent  discovery  and  publication’.  Kruskal  and  others  give  at  least  the 
following  independently-discovered  valiants  of  the  algorithm  published  in 
four  separate  fields: 


Citation 
Viterbi  (1967) 

Vintsyuk  (1968) 

Needleman  and  Wunsch  (1970) 
Sakoe  and  Chiba  (1971) 
Sankoff  (1972) 

Reichert  et  al.  (1973) 

Wagner  and  Fischer  (1974) 


Field 

information  theory 
speech  processing 
molecular  biology 
speech  processing 
molecular  biology 
molecular  biology 
computer  science 


To  the  extent  that  there  is  any  standard  to  terminology  in  speech  and 
language  processing,  it  is  the  use  of  the  term  Viterbi  for  the  application  of 
dynamic  programming  to  any  kind  of  probabilistic  maximization  problem. 
For  non-probabilistic  problems,  the  plain  term  dynamic  programming  is 
often  used.  The  history  of  the  forward  algorithm,  which  derives  from  Hid- 
den Markov  Models,  will  be  summarized  in  Chapter  7.  Sankoff  and  Kruskal 
(1983)  is  a collection  exploring  the  theory  and  use  of  sequence  comparison 
in  different  fields.  Forney  (1973)  is  an  early  survey  paper  which  explores  the 
origin  of  the  Viterbi  algorithm  in  the  context  of  information  and  communi- 
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cations  theory. 

The  weighted  finite-state  automata  was  first  described  by  (Pereira  et  cil. , 
1994),  drawing  from  a combination  of  work  in  finite-state  transducers  and 
work  in  probabilistic  languages  (Booth  and  Thompson,  1973). 
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Exercises 

5.1  Computing  minimum  edit  distances  by  hand,  figure  out  whether  drive 
is  closer  to  brief  or  to  divers , and  what  the  edit  distance  is.  You  may  use  any 
version  of  distance  that  you  like. 

5.2  Now  implement  a minimum  edit  distance  algorithm  and  use  your  hand- 
computed  results  to  check  your  code. 

5.3  The  Viterbi  algorithm  can  be  used  to  extend  a simplified  version  of 
the  Kernighan  et  al.  (1990)  spelling  error  correction  algorithm.  Recall  that 
the  Kernighan  et  al.  (1990)  algorithm  only  allowed  a single  spelling  error 
for  each  potential  correction.  Let’s  simplify  by  assuming  that  we  only  have 
three  confusion  matrices  instead  of  four  {del,  ins  and  sub ; no  trans).  Now 
show  how  the  Viterbi  algorithm  can  be  used  to  extend  the  Kernighan  et  al. 

(1990)  algorithm  to  handle  multiple  spelling  errors  per  word. 

5.4  To  attune  your  ears  to  pronunciation  reduction,  listen  for  the  pronun- 
ciation of  the  word  the,  a,  or  to  in  the  spoken  language  around  you.  Try  to 
notice  when  it  is  reduced,  and  mark  down  whatever  facts  about  the  speaker 
or  speech  situation  that  you  can.  What  are  your  observations? 

5.5  Find  a speaker  of  a different  dialect  of  English  than  your  own  (even 
someone  from  a slightly  different  region  of  your  native  dialect)  and  tran- 
scribe (using  the  ARPAbet  or  IPA)  10  words  that  they  pronounce  differently 
than  you.  Can  you  spot  any  generalizations? 

5.6  Implement  the  Forward  algorithm. 

5.7  Write  a modified  version  of  the  Viterbi  algorithm  which  solves  the  seg- 
mentation problem  from  Sproat  et  al.  (1996). 

5.8  Now  imagine  a version  of  English  that  was  written  without  spaces. 

Apply  your  segmentation  program  to  this  ‘compressed  English'.  You  will 
need  other  programs  to  compute  word  bigrams  or  trigrams. 

5.9  Two  words  arc  confusable  if  they  have  phonetically  similar  pronunci-  confusable 
ations.  Use  one  of  your  dynamic  programming  implementations  to  take  two 

words  and  output  a simple  measure  of  how  confusable  they  arc.  You  will 
need  to  use  an  on-line  pronunciation  dictionary.  You  will  also  need  a metric 
for  how  close  together  two  phones  arc.  Use  your  favorite  set  of  phonetic 
feature  vectors  for  this.  You  may  assume  some  small  constant  probability  of 
phone  insertion  and  deletion. 
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But  it  must  be  recognized  that  the  notion  ‘probability  of  a sen- 
tence ’ is  an  entirely  useless  one,  under  any  known  interpretation 
of  this  term. 

Noam  Chomsky  (1969,  p.  57) 

Anytime  a linguist  leaves  the  group  the  recognition  rate  goes  up. 
Fred  Jelinek  (then  of  the  IBM  speech  group)  (1988)1 


Imagine  listening  to  someone  as  they  speak  and  trying  to  guess  the  next 
word  that  they  are  going  to  say.  For  example  what  word  is  likely  to  follow 
this  sentence  fragment?: 

I’d  like  to  make  a collect. . . 

Probably  the  most  likely  word  is  call,  although  it’s  possible  the  next 
word  could  be  telephone,  or  person-to-person  or  international.  (Think  of 
some  others).  Guessing  the  next  word  (or  word  prediction)  is  an  essen-  prediction 
tial  subtask  of  speech  recognition,  hand-writing  recognition,  augmentative 
communication  for  the  disabled,  and  spelling  error  detection.  In  such  tasks, 
word-identification  is  difficult  because  the  input  is  very  noisy  and  ambigu- 
ous. Thus  looking  at  previous  words  can  give  us  an  important  cue  about 
what  the  next  ones  are  going  to  be.  Russell  and  Norvig  (1995)  give  an  exam- 
ple from  Take  the  Money  and  Run,  in  which  a bank  teller  interprets  Woody 
Allen’s  sloppily  written  hold-up  note  as  saying  “I  have  a gub”.  A speech 

1 In  an  address  to  the  first  Workshop  on  the  Evaluation  of  Natural  Language  Processing 
Systems,  December  7,  1988.  While  this  workshop  is  described  in  Palmer  and  Finin  (1990), 
the  quote  was  not  written  down;  some  participants  remember  a more  snappy  version:  Every 
time  I fire  a linguist  the  performance  of  the  recognizer  improves. 
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recognition  system  (and  a person)  can  avoid  this  problem  by  their  knowl- 
edge of  word  sequences  (“a  gub”  isn’t  an  English  word  sequence)  and  of 
their  probabilities  (especially  in  the  context  of  a hold-up,  “I  have  a gun”  will 
have  a much  higher  probability  than  “I  have  a gub”  or  even  “I  have  a gull”). 

This  ability  to  predict  the  next  word  is  important  for  augmentative 
communication  systems  (Newell  et  ah,  1998).  These  arc  computer  sys- 
tems that  help  the  disabled  in  communication.  For  example,  people  who 
are  unable  to  use  speech  or  sign-language  to  communicate,  like  the  physi- 
cist Steven  Hawkings,  use  systems  that  speak  for  them,  letting  them  choose 
words  with  simple  hand  movements,  either  by  spelling  them  out,  or  by  se- 
lecting from  a menu  of  possible  words.  But  spelling  is  very  slow,  and  a menu 
of  words  obviously  can’t  have  all  possible  English  words  on  one  screen. 
Thus  it  is  important  to  be  able  to  know  which  words  the  speaker  is  likely  to 
want  to  use  next,  so  as  to  put  those  on  the  menu. 

Finally,  consider  the  problem  of  detecting  real-word  spelling  errors. 
These  are  spelling  errors  that  result  in  real  words  of  English  (although  not 
the  ones  the  writer  intended)  and  so  detecting  them  is  difficult  (we  can’t  find 
them  by  just  looking  for  words  that  aren’t  in  the  dictionary).  Figure  6.1  gives 
some  examples. 

They  are  leaving  in  about  fifteen  minuets  to  go  to  her  house. 

The  study  was  conducted  mainly  be  John  Black. 

The  design  an  construction  of  the  system  will  take  more  than  a year. 

Hopefully,  all  with  continue  smoothly  in  my  absence. 

Can  they  lave  him  my  messages? 

I need  to  notified  the  bank  of  [this  problem.] 

He  is  trying  to  fine  out. 

Figure  6.1  Some  attested  real-word  spelling  errors  from  Kukich  (1992). 


These  errors  can  be  detected  by  algorithms  which  examine,  among 
other  features,  the  words  surrounding  the  errors.  For  example,  while  the 
phrase  in  about  fifteen  minuets  is  perfectly  grammatical  English,  it  is  a very 
unlikely  combination  of  words.  Spellcheckers  can  look  for  low  probability 
combinations  like  this.  In  the  examples  above  the  probability  of  three  word 
combinations  ( they  lave  him,  to  fine  out,  to  notified  the)  is  very  low.  Of 
course  sentences  with  no  spelling  errors  may  also  have  low  probability  word 
sequences,  which  makes  the  task  challenging.  We  will  see  in  Section  6.6  that 
there  are  a number  of  different  machine  learning  algorithms  which  make  use 
of  the  surrounding  words  and  other  features  to  do  context-sensitive  spelling 
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error  correction. 

Guessing  the  next  word  turns  out  to  be  closely  related  to  another  prob- 
lem: computing  the  probability  of  a sequence  of  words.  For  example  the 
following  sequence  of  words  has  a non-zero  probability  of  being  encoun- 
tered in  a text  written  in  English: 

...  all  of  a sudden  I notice  three  guys  standing  on  the  sidewalk 

taking  a very  good  long  gander  at  me. 

while  this  same  set  of  words  in  a different  order  probably  has  a very  low 
probability: 

good  all  I of  notice  a taking  sidewalk  the  me  long  three  at  sudden 

guys  gander  on  standing  a a the  very 

Algorithms  that  assign  a probability  to  a sentence  can  also  be  used  to 
assign  a probability  to  the  next  word  in  an  incomplete  sentence,  and  vice 
versa.  We  will  see  in  later  chapters  that  knowing  the  probability  of  whole 
sentences  or  strings  of  words  is  useful  in  part-of-speech-tagging  (Chapter  8), 
word-sense  disambiguation,  and  probabilistic  parsing  Chapter  12. 

In  speech  recognition,  it  is  traditional  to  use  the  term  language  model  ma0ndgeulage 
or  LM  for  a statistical  model  of  word  sequences.  In  the  rest  of  this  chapter  lm 
we  will  be  using  both  language  model  and  grammar,  depending  on  the 
context. 

6. 1 Counting  Words  in  Corpora 

[upon  being  asked  if  there  weren't  enough  words  in  the  English  language  for  him]: 

“Yes,  there  are  enough,  but  they  aren’t  the  right  ones.” 

James  Joyce,  reported  in  Bates  (1997) 

Probabilities  are  based  on  counting  things.  Before  we  talk  about  prob- 
abilities, we  need  to  decide  what  we  are  going  to  count  and  where  we  are 
going  to  find  the  things  to  count. 

As  we  saw  in  Chapter  5,  statistical  processing  of  natural  language  is 
based  on  corpora  (singular  corpus),  on-line  collections  of  text  and  speech,  corpora 
For  computing  word  probabilities,  we  will  be  counting  words  in  a training  corpus 
corpus.  Let’s  look  at  part  of  the  Brown  Corpus,  a 1 million  word  collection 
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of  samples  from  500  written  texts  from  different  genres  (newspaper,  nov- 
els, non-fiction,  academic,  etc.),  which  was  assembled  at  Brown  University 
in  1963-64  (Kucera  and  Francis,  1967;  Francis,  1979;  Francis  and  Kucera, 
1982).  It  contains  sentence  (6.1);  how  many  words  arc  in  this  sentence? 

(6. 1)  He  stepped  out  into  the  hall,  was  delighted  to  encounter  a water 

brother. 

Example  6.1  has  13  words  if  we  don’t  count  punctuation-marks  as 
words,  15  if  we  count  punctuation.  Whether  we  treat  period  (V),  comma 
(V),  and  so  on  as  words  depends  on  the  task.  There  arc  tasks  such  as 
grammar-checking,  spelling  error  detection,  or  author-identification,  for  which 
the  location  of  the  punctuation  is  important  (for  checking  for  proper  capital- 
ization at  the  beginning  of  sentences,  or  looking  for  interesting  patterns  of 
punctuation  usage  that  uniquely  identify  an  author).  In  natural  language  pro- 
cessing applications,  question-marks  arc  an  important  cue  that  someone  has 
asked  a question.  Punctuation  is  a useful  cue  for  part-of-speech  tagging. 
These  applications,  then,  often  count  punctuation  as  words. 

Unlike  text  corpora,  corpora  of  spoken  language  usually  don’t  have 
punctuation,  but  speech  corpora  do  have  other  phenomena  that  we  might  or 
might  not  want  to  treat  as  words.  One  speech  corpus,  the  Switchboard  corpus 
of  telephone  conversations  between  strangers,  was  collected  in  the  early 
1990's  and  contains  2430  conversations  averaging  6 minutes  each,  for  a total 
of  240  hours  of  speech  and  3 million  words  (Godfrey  et  al.,  1992).  Here’s 
a sample  utterance  of  Switchboard  (since  the  units  of  spoken  language  arc 
different  than  written  language,  we  will  use  the  word  utterance  rather  than 
‘sentence’  when  we  arc  referring  to  spoken  language): 

(6.2)  I do  uh  main-  mainly  business  data  processing 

This  utterance,  like  many  or  most  utterances  in  spoken  language,  has 
fragments,  words  that  arc  broken  off  in  the  middle,  like  the  first  instance 
of  the  word  mainly,  represented  here  as  main-.  It  also  has  filled  pauses  like 
uh,  which  doesn’t  occur  in  written  English.  Should  we  consider  these  to  be 
words?  Again,  it  depends  on  the  application.  If  we  arc  building  an  automatic 
dictation  system  based  on  automatic  speech  recognition,  we  might  want  to 
strip  out  the  fragments.  But  the  uhs  and  urns  arc  in  fact  much  more  like 
words.  For  example.  Smith  and  Clark  (1993)  and  Clark  (1994)  have  shown 
that  um  has  a slightly  different  meaning  than  uh  (generally  speaking  tun  is 
used  when  speakers  arc  having  major  planning  problems  in  producing  an 
utterance,  while  uh  is  used  when  they  know  what  they  want  to  say,  but  arc 
searching  for  the  exact  words  to  express  it).  Stolcke  and  Shriberg  (1996b) 
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also  found  that  uli  can  be  a useful  cue  in  predicting  the  next  word  (why  might 
this  be?),  and  so  most  speech  recognition  systems  treat  uh  as  a word. 

Are  capitalized  tokens  like  They  and  uncapitalized  tokens  like  they  the 
same  word?  For  most  statistical  applications  these  arc  lumped  together, 
although  sometimes  (for  example  for  spelling  error  correction  or  part-of- 
speech-tagging)  the  capitalization  is  retained  as  a separate  feature.  For  the 
rest  of  this  chapter  we  will  assume  our  models  arc  not  case-sensitive. 

How  should  we  deal  with  inflected  forms  like  cats  versus  call  Again, 
this  depends  on  the  application.  Most  current  A-gram  based  systems  arc 
based  on  the  wordform,  which  is  the  inflected  form  as  it  appeal's  in  the 
corpus.  Thus  these  are  treated  as  two  separate  words.  This  is  not  a good 
simplification  for  many  domains,  which  might  want  to  treat  cats  and  cat  as 
instances  of  a single  abstract  word,  or  lemma.  A lemma  is  a set  of  lexical 
forms  having  the  same  stem,  the  same  major  part  of  speech,  and  the  same 
word-sense.  We  will  return  to  the  distinction  between  wordform s (which 
distinguish  cat  and  cats)  and  lemmas  (which  lump  cat  and  cats  together)  in 
Chapter  16. 

How  many  words  are  there  in  English?  One  way  to  answer  this  ques- 
tion is  to  count  in  a corpus.  We  use  types  to  mean  the  number  of  distinct 
words  in  a corpus,  i.e.  the  size  of  the  vocabulary,  and  tokens  to  mean  the 
total  number  of  running  words.  Thus  the  following  sentence  from  the  Brown 
corpus  has  16  word  tokens  and  14  word  types  (not  counting  punctuation): 

(6.3)  They  picnicked  by  the  pool,  then  lay  back  on  the  grass  and  looked  at 

the  stars. 

The  Switchboard  corpus  has  2.4  million  wordform  tokens  and  ap- 
proximately 20,000  wordform  types.  This  includes  proper  nouns.  Spoken 
language  is  less  rich  in  its  vocabulary  than  written  language:  Kucera  (1992) 
gives  a count  for  Shakespeare’s  complete  works  at  884,647  wordform  tokens 
from  29,066  wordform  types.  Thus  each  of  the  884,647  wordform  tokens  is 
a repetition  of  one  of  the  29,066  wordform  types.  The  1 million  wordform 
tokens  of  the  Brown  corpus  contain  61,805  wordform  types  that  belong  to 
37,851  lemma  types.  All  these  corpora  are  quite  small.  Brown  el  al.  (1992) 
amassed  a corpus  of  583  million  wordform  tokens  of  English  that  included 
293,181  different  wordform  types. 

Dictionaries  are  another  way  to  get  an  estimate  of  the  number  of  words, 
although  since  dictionaries  generally  do  not  include  inflected  forms  they  arc 
better  at  measuring  lemmas  than  wordforms.  The  American  Heritage  3rd 
edition  dictionary  has  200,000  “boldface  forms”;  this  is  somewhat  higher 
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than  the  true  number  of  lemmas,  since  there  can  be  one  or  more  boldface 
form  per  lemma  (and  since  the  boldface  forms  includes  multiword  phrases). 

The  rest  of  this  chapter  will  continue  to  distinguish  between  types  and 
tokens.  ‘Types’  will  mean  wordform  types  and  not  lemma  types,  and  punc- 
tuation marks  will  generally  be  counted  as  words. 


6.2  Simple  (Unsmoothed)  TV-grams 

The  models  of  word  sequences  we  will  consider  in  this  chapter  arc  proba- 
bilistic models;  ways  to  assign  probabilities  to  strings  of  words,  whether  for 
computing  the  probability  of  an  entire  sentence  or  for  giving  a probabilistic 
prediction  of  what  the  next  word  will  be  in  a sequence.  As  we  did  in  Chap- 
ter 5,  we  will  assume  that  the  reader  has  a basic  knowledge  of  probability 
theory. 

The  simplest  possible  model  of  word  sequences  would  simply  let  any 
word  of  the  language  follow  any  other  word.  In  the  probabilistic  version  of 
this  theory,  then,  every  word  would  have  an  equal  probability  of  following 
every  other  word.  If  English  had  100,000  words,  the  probability  of  any  word 
following  any  other  word  would  be  1()()1()Q()  or  .00001. 

In  a slightly  more  complex  model  of  word  sequences,  any  word  could 
follow  any  other  word,  but  the  following  word  would  appear  with  its  nor- 
mal frequency  of  occurrence.  For  example,  the  word  the  has  a high  relative 
frequency,  it  occurs  69,971  times  in  the  Brown  corpus  of  1,000,000  words 
(i.e.  7%  of  the  words  in  this  particular  corpus  are  the).  By  contrast  the  word 
rabbit  occurs  only  1 1 times  in  the  Brown  corpus. 

We  can  use  these  relative  frequencies  to  assign  a probability  distribu- 
tion across  following  words.  So  if  we’ve  just  seen  the  string  Anyhow,  we  can 
use  the  probability  .07  for  the  and  .00001  for  rabbit  to  guess  the  next  word. 
But  suppose  we’ve  just  seen  the  following  string: 

Just  then,  the  white 

In  this  context  rabbit  seems  like  a more  reasonable  word  to  follow 
white  than  the  does.  This  suggests  that  instead  of  just  looking  at  the  in- 
dividual relative  frequencies  of  words,  we  should  look  at  the  conditional 
probability  of  a word  given  the  previous  words.  That  is,  the  probability 
of  seeing  rabbit  given  that  we  just  saw  white  (which  we  will  represent  as 
P(rabbit\white ))  is  higher  than  the  probability  of  rabbit  otherwise. 

Given  this  intuition,  let’s  look  at  how  to  compute  the  probability  of  a 
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complete  string  of  words  (which  we  can  represent  either  as  w\ . . . wn  or  w"). 
If  we  consider  each  word  occurring  in  its  correct  location  as  an  independent 
event,  we  might  represent  this  probability  as  follows: 

P(wi,w2...,w„_i,w„)  (6.4) 

We  can  use  the  chain  rule  of  probability  to  decompose  this  probability: 

P(w'l)  = P(wi)P(w2\wi)P(w3\wj) . . .P(wn\w1~l) 

= rpKIw*-1)  (6.5) 

k=  l 

But  how  can  we  compute  probabilities  like  P(wn |w”-1)?  We  don’t 
know  any  easy  way  to  compute  the  probability  of  a word  given  a long  se- 
quence of  preceding  words.  (For  example,  we  can’t  just  count  the  number  of 
times  every  word  occurs  following  every  long  string;  we  would  need  far  too 
large  a corpus). 

We  solve  this  problem  by  making  a useful  simplification:  we  approxi- 
mate the  probability  of  a word  given  all  the  previous  words.  The  approxima- 
tion we  will  use  is  very  simple:  the  probability  of  the  word  given  the  single 
previous  word!  The  bigram  model  approximates  the  probability  of  a word 
given  all  the  previous  words  P{wn\w'[~l ) by  the  conditional  probability  of 
the  preceding  word  P(wn\wn-\ ).  In  other  words,  instead  of  computing  the 
probability 

P (rabbit  | Just  the  other  I day  I saw  a)  (6.6) 

we  approximate  it  with  the  probability 

P (rabbit  | a)  (6.7) 

This  assumption  that  the  probability  of  a word  depends  only  on  the 
previous  word  is  called  a Markov  assumption.  Markov  models  are  the  class 
of  probabilistic  models  that  assume  that  we  can  predict  the  probability  of 
some  future  model  without  looking  too  far  into  the  past.  We  saw  this  use  of 
the  word  Markov  in  introducing  the  Markov  chain  in  Chapter  5.  Recall  that 
a Markov  chain  is  a kind  of  weighted  finite-state  automaton;  the  intuition  of 
the  term  Markov  in  Markov  chain  is  that  the  next  state  of  a weighted  FSA  is 
always  dependent  on  a finite  history  (since  the  number  of  states  in  a finite- 
state  automaton  is  finite).  The  simple  bigram  model  can  be  viewed  as  a 
simple  kind  of  Markov  chain  which  has  one  state  for  each  word. 

We  can  generalize  the  bigram  (which  looks  one  word  into  the  past)  to 
the  trigram  (which  looks  two  words  into  the  past)  and  thus  to  the  N-gram 
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(which  looks  IV  — 1 words  into  the  past).  A bigram  is  called  a first-order  first-order 
Markov  model  (because  it  looks  one  token  into  the  past),  a trigram  is  a 
second-order  Markov  model,  and  in  general  an  /V-gram  is  a N — 1th  or- 
der Markov  model.  Markov  models  of  words  were  common  in  engineering, 
psychology,  and  linguistics  until  Chomsky’s  influential  review  of  Skinner’s 
Verbal  Behavior  in  1958  (see  the  History  section  at  the  back  of  the  chapter), 
but  went  out  of  vogue  until  the  success  of  /V-gram  models  in  the  IBM  speech 
recognition  laboratory  at  the  Thomas  J.  Watson  Research  Center,  brought 
them  back  to  the  attention  of  the  community. 

The  general  equation  for  this  /V-gram  approximation  to  the  conditional 
probability  of  the  next  word  in  a sequence  is: 

P{wn\w\-X)  ~ P{wn\wnnZlN+x)  (6.8) 

Equation  6.8  shows  that  the  probability  of  a word  wn  given  all  the  pre- 
vious words  can  be  approximated  by  the  probability  given  only  the  previous 
N words. 

For  a bigram  grammar,  then,  we  compute  the  probability  of  a complete 
string  by  substituting  equation  6.8  into  equation  6.5.  The  result: 

n 

pm)  « rp(-*K-i)  (6-9) 

k= i 

Let’s  look  at  an  example  from  a speech-understanding  system.  The 
Berkeley  Restaurant  Project  is  a speech-based  restaurant  consultant;  users 
ask  questions  about  restaurants  in  Berkeley,  California,  and  the  system  dis- 
plays appropriate  information  from  a database  of  local  restaurants  (Jurafsky 
el  ai,  1994).  Here  are  some  sample  user  queries: 

I'm  looking  for  Cantonese  food. 

I'd  like  to  eat  dinner  someplace  nearby. 

Tell  me  about  Chez  Panisse. 

Can  you  give  me  a listing  of  the  kinds  of  food  that  arc  available? 

I'm  looking  for  a good  place  to  eat  breakfast. 

I definitely  do  not  want  to  have  cheap  Chinese  food. 

When  is  Caffe  Venezia  open  during  the  day? 

I don’t  wanna  walk  more  than  ten  minutes. 

Table  6.2  shows  a sample  of  the  bigram  probabilities  for  some  of  the 
words  that  can  follow  the  word  eat,  taken  from  actual  sentences  spoken  by 
users  (putting  off  just  for  now  the  algorithm  for  training  bigram  probabil- 
ities). Note  that  these  probabilities  encode  some  facts  that  we  think  of  as 
strictly  syntactic  in  nature  (like  the  fact  that  what  comes  after  eat  is  usually 
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something  that  begins  a noun  phrase,  i.e.  an  adjective,  quantifier  or  noun), 
as  well  as  facts  that  we  think  of  as  more  culturally  based  (like  the  low  prob- 
ability of  anyone  asking  for  advice  on  finding  British  food). 


eat  on 

.16 

eat  Thai 

.03 

eat  some 

.06 

eat  breakfast 

.03 

eat  lunch 

.06 

eat  in 

.02 

eat  dinner 

.05 

eat  Chinese 

.02 

eat  at 

.04 

eat  Mexican 

.02 

eat  a 

.04 

eat  tomorrow 

.01 

eat  Indian 

.04 

eat  dessert 

.007 

eat  today 

.03 

eat  British 

.001 

Figure  6.2  A fragment  of  a bigram  grammar  from  the  Berkeley  Restaurant 
Project  showing  the  most  likely  words  to  follow  eat. 

Assume  that  in  addition  to  the  probabilities  in  Table  6.2,  our  grammar 
also  includes  the  bigram  probabilities  in  Table  6.3  (<s>  is  a special  word 
meaning  'Start  of  sentence’). 


<s>  I .25 

I want 

.32 

want  to 

.65 

to  eat 

.26 

British  food  .60 

<s>  I’d  .06 

I would  .29 

want  a 

.05 

to  have 

.14 

British  restaurant  .15 

<s>  Tell  .04 

I don’t 

.08 

want  some  .04 

to  spend  .09 

British  cuisine  .01 

<s>  I’m  .02 

I have 

.04 

want  thai 

.01 

to  be 

.02 

British  lunch  .01 

Figure  6.3  More  fragments  from  the  bigram  grammar  from  the  Berkeley 
Restaurant  Project. 


Now  we  can  compute  the  probability  of  sentences  like  I want  to  eat 
British  food  or  I want  to  eat  Chinese  food  by  simply  multiplying  the  appro- 
priate bigram  probabilities  together,  as  follows: 

P(I  want  to  eat  British  food)  = /5(I|<s>)/,(want|I)P(to|want)/,(eat|to) 

P(  British  | eat)/5  (food  | British) 


= .25  * .32  * .65  * .26  * .002  * .60 
= .000016 

As  we  can  see,  since  probabilities  are  all  less  than  1 (by  definition),  the 
product  of  many  probabilities  gets  smaller  the  more  probabilities  we  multi- 
ply. This  causes  a practical  problem:  the  risk  of  numerical  underflow.  If  we 
are  computing  the  probability  of  a very  long  string  (like  a paragraph  or  an 
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LOGPROB 


TRIGRAM 


NORMALIZING 


RELATIVE 

FREQUENCY 


MAXIMUM 

LIKELIHOOD 

ESTIMATION 

MLE 


entire  document)  it  is  more  customary  to  do  the  computation  in  log  space;  we 
take  the  log  of  each  probability  (the  logprob),  add  all  the  logs  (since  adding 
in  log  space  is  equivalent  to  multiplying  in  linear  space)  and  then  take  the 
anti-log  of  the  result.  For  this  reason  many  standard  programs  for  computing 
/V-grams  actually  store  and  calculate  all  probabilities  as  logprobs.  In  this  text 
we  will  always  report  logs  in  base  2 (i.e.  we  will  use  log  to  mean  log2). 

A trigram  model  looks  just  the  same  as  a bigram  model,  except  that 
we  condition  on  the  two  previous  words  (e.g.  we  use  P(food\eat  British) 
instead  of  P(focd  British)).  To  compute  trigram  probabilities  at  the  very 
beginning  of  sentence,  we  can  use  two  pseudo-words  for  the  first  trigram 
(i.e.  P(I\  < startl  ><  start2  >)). 

/V-gram  models  can  be  trained  by  counting  and  normalizing  (for  prob- 
abilistic models,  normalizing  means  dividing  by  some  total  count  so  that  the 
resulting  probabilities  fall  legally  between  0 and  1).  We  take  some  training 
corpus,  and  from  this  corpus  take  the  count  of  a particular  bigram,  and  divide 
this  count  by  the  sum  of  all  the  bigrams  that  share  the  same  first  word: 


P(w„\wn-i) 


C (vV/i- 1 wn ) 
EwC(wn-iw) 


(6.10) 


We  can  simplify  this  equation,  since  the  sum  of  all  bigram  counts  that 
start  with  a given  word  w„_i  must  be  equal  to  the  unigram  count  for  that 
word  w n i . (The  reader  should  take  a moment  to  be  convinced  of  this): 


P{wn\wn-i) 


C(w„_iw„) 
C(wn- 1) 


For  the  general  case  of  A'-gram  parameter  estimation: 


(6.11) 


P{wn\w"-N+l 


C«-N+lWn) 

1) 


(6.12) 


Equation  6. 12  estimates  the  /V-grant  probability  by  dividing  the  ob- 
served frequency  of  a particular-  sequence  by  the  observed  frequency  of  a 
prefix.  This  ratio  is  called  a relative  frequency;  the  use  of  relative  fre- 
quencies as  a way  to  estimate  probabilities  is  one  example  of  the  technique 
known  as  Maximum  Likelihood  Estimation  or  MLE,  because  the  resulting 
parameter  set  is  one  in  which  the  likelihood  of  the  training  set  T given  the 
model  M (i.e.  P(T  M\)  is  maximized.  For  example,  suppose  the  word  Chi- 
nese occurs  400  times  in  a corpus  of  a million  words  like  the  Brown  corpus. 
What  is  the  probability  that  it  will  occur  in  some  other  text  of  way  a million 
words?  The  MLE  estimate  of  its  probability  is  1()qqq0()  or  .0004.  Now  .0004 
is  not  the  best  possible  estimate  of  the  probability  of  Chinese  occurring  in  all 
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situations;  but  it  is  the  probability  that  makes  it  most  likely  that  Chinese  will 
occur  400  times  in  a million-word  corpus. 

There  arc  better  methods  of  estimating  /V-gram  probabilities  than  using 
relative  frequencies  (we  will  consider  a class  of  important  algorithms  in  Sec- 
tion 6.3),  but  even  the  more  sophisticated  algorithms  make  use  in  some  way 
of  this  idea  of  relative  frequency.  Figure  6.4  shows  the  bigram  counts  from  a 
piece  of  a bigram  grammar  from  the  Berkeley  Restaurant  Project.  Note  that 
the  majority  of  the  values  arc  zero.  In  fact  we  have  chosen  the  sample  words 
to  cohere  with  each  other;  a matrix  selected  from  a random  set  of  7 words 
would  be  even  more  sparse. 


I 

want 

to 

eat 

Chinese 

food 

lunch 

I 

8 

1087 

0 

13 

0 

0 

0 

want 

3 

0 

786 

0 

6 

8 

6 

to 

3 

0 

10 

860 

3 

0 

12 

eat 

0 

0 

2 

0 

19 

2 

52 

Chinese 

2 

0 

0 

0 

0 

120 

1 

food 

19 

0 

17 

0 

0 

0 

0 

lunch 

4 

0 

0 

0 

0 

1 

0 

Figure  6.4  Bigram  counts  for  7 of  the  words  (out  of  1616  total  word  types) 

in  the  Berkeley  Restaurant  Project  corpus  of 

'10,000  sentences. 

Figure  6.5  shows  the  bigram  probabilities  after  normalization  (dividing 
each  row  by  the  following  appropriate  unigram  counts: 


I 

3437 

want 

1215 

to 

3256 

eat 

938 

Chinese 

213 

food 

1506 

lunch 

459 

More  on  N-grams  and  their  sensitivity  to  the  training  corpus 

In  this  section  we  look  at  a few  examples  of  different  /V-gram  models  to 
get  an  intuition  for  two  important  facts  about  their  behavior.  The  first  is  the 
increasing  accuracy  of  IV-gram  models  as  we  increase  the  value  of  N.  The 
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I 

want 

to 

eat 

Chinese 

food 

lunch 

I 

.32 

.0038 

0 

want 

0 

.65 

0 

.0049 

.0066 

.0049 

to 

.00092 

0 

.0031 

.26 

.00092 

0 

.0037 

eat 

0 

0 

.0021 

0 

.020 

.0021 

.055 

Chinese 

.0094 

0 

0 

0 

0 

.56 

.0047 

food 

.013 

0 

.011 

0 

0 

0 

0 

lunch 

.0087 

0 

0 

0 

0 

.0022 

0 

Figure  6.5  Bigram  probabilities  for  7 of  the  words  (out  of  1616  total  word 

types)  in  the  Berkeley  Restaurant  Project  corpus  of  "10,000  sentences. 

second  is  their  very  strong  dependency  on  their  training  corpus  (in  particular 
its  genre  and  its  size  in  words). 

We  do  this  by  borrowing  a visualization  technique  proposed  by  Shan- 
non (1951)  and  also  used  by  Miller  and  Selfridge  (1950).  The  idea  is  to  train 
various  /V-grams  and  then  use  each  to  generate  random  sentences.  It's  sim- 
plest to  visualize  how  this  works  for  the  unigram  case.  Imagine  all  the  words 
of  English  covering  the  probability  space  between  0 and  1 . We  choose  a ran- 
dom number  between  0 and  1,  and  print  out  the  word  that  covers  the  real 
value  we  have  chosen.  The  same  technique  can  be  used  to  generate  higher 
order  /V-grams  by  first  generating  a random  bigram  that  starts  with  <s>  (ac- 
cording to  its  bigram  probability),  then  choosing  a random  bigram  to  follow 
it  (again,  where  the  likelihood  of  following  a particular  bigram  is  propor- 
tional to  its  conditional  probability),  and  so  on. 

To  give  an  intuition  for  the  increasing  power  of  higher-order  /V-grams, 
we  trained  a unigram,  bigram,  trigram,  and  a quadrigram  model  on  the  com- 
plete corpus  of  Shakespeare’s  works.  We  then  used  these  four  grammars  to 
generate  random  sentences.  In  the  following  examples  we  treated  each  punc- 
tuation mark  as  if  it  were  a word  in  its  own  right,  and  we  trained  the  gram- 
mar's on  a version  of  the  corpus  with  all  capital  letters  changed  to  lowercase. 
After  generated  the  sentences  we  corrected  the  output  for  capitalization  just 
to  improve  readability.  Some  of  the  resulting  sentences: 

1 . Unigram  approximation  to  Shakespeare 

(a)  To  him  swallowed  confess  hear  both.  Which.  Of  save  on  trail  for  are  ay 
device  and  rote  life  have 

(b)  Every  enter  now  severally  so,  let 

(c)  Hill  he  late  speaks;  or!  a more  to  leg  less  first  you  enter 
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(d)  Will  rash  been  and  by  1 the  me  loves  gentle  me  not  slavish  page,  the  and 
hour;  ill  let 

(e)  Are  where  exeunt  and  sighs  have  rise  excellency  took  of..  Sleep  knave 
we.  near;  vile  like 

2.  Bigram  approximation  to  Shakespeare 

(a)  What  means,  sir.  I confess  she?  then  all  sorts,  he  is  trim,  captain. 

(b)  Why  dost  stand  forth  thy  canopy,  forsooth;  he  is  this  palpable  hit  the 
King  Henry.  Live  king.  Follow. 

(c)  What  we,  hath  got  so  she  that  I rest  and  sent  to  scold  and  nature  bankrupt, 
nor  the  first  gentleman? 

(d)  Enter  Menenius,  if  it  so  many  good  direction  found’st  thou  art  a strong 
upon  command  of  fear  not  a liberal  largess  given  away,  Falstaff ! Exeunt 

(e)  Thou  whoreson  chops.  Consumption  catch  your  dearest  friend,  well, 
and  I know  where  many  mouths  upon  my  undoing  all  but  be,  how  soon, 
then;  we’ll  execute  upon  my  love’s  bonds  and  we  do  you  will? 

( f)  The  world  shall-  my  lord ! 

3.  Trigram  approximation  to  Shakespeare 

(a)  Sweet  prince,  Falstaff  shall  die.  Harry  of  Monmouth’s  grave. 

(b)  This  shall  forbid  it  should  be  branded,  if  renown  made  it  empty. 

(c)  What  is’t  that  cried? 

(d)  Indeed  the  duke;  and  had  a very  good  friend. 

(e)  Fly,  and  will  rid  me  these  news  of  price.  Therefore  the  sadness  of  part- 
ing, as  they  say,  ’tis  done. 

(f)  The  sweet!  How  many  then  shall  posthumus  end  his  miseries. 

4.  Quadrigram  approximation  to  Shakespeare 

(a)  King  Henry.  What!  I will  go  seek  the  traitor  Gloucester.  Exeunt  some 
of  the  watch.  A great  banquet  serv’d  in; 

(b)  Will  you  not  tell  me  who  I am? 

(c)  It  cannot  be  but  so. 

(d)  Indeed  the  short  and  the  long.  Marry,  ’tis  a noble  Lepidus. 

(e)  They  say  all  lovers  swear  more  performance  than  they  are  wont  to  keep 
obliged  faith  unforfeited ! 

(f)  Enter  Leonato’s  brother  Antonio,  and  the  rest,  but  seek  the  weary  beds 
of  people  sick. 

The  longer  the  context  on  which  we  train  the  model,  the  more  coher- 
ent the  sentences.  In  the  unigram  sentences,  there  is  no  coherent  relation 
between  words,  and  in  fact  none  of  the  sentences  end  in  a period  or  other 
sentence-final  punctuation.  The  bigram  sentences  can  be  seen  to  have  very 
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The  probabilities  in  a statistical  model  like  an  /V-grarn  come  from 
the  corpus  it  is  trained  on.  This  training  corpus  needs  to  be  care- 
fully designed.  If  the  training  corpus  is  too  specific  to  the  task  or 
domain,  the  probabilities  may  be  too  narrow  and  not  generalize  well 
to  new  sentences.  If  the  training  corpus  is  too  general,  the  probabil- 
ities may  not  do  a sufficient  job  of  reflecting  the  task  or  domain. 

Furthermore,  suppose  we  arc  trying  to  compute  the  probabil- 
ity of  a particular  ‘test’  sentence.  If  our  ‘test’  sentence  is  paid  of 
the  training  corpus,  it  will  have  an  artificially  high  probability.  The 
training  corpus  must  not  be  biased  by  including  this  sentence.  Thus 
when  using  a statistical  model  of  language  given  some  corpus  of  rel- 
evant data,  we  staid  by  dividing  the  data  into  a training  set  and  a test 
set.  We  train  the  statistical  parameters  of  the  model  on  the  training 
set,  and  then  use  them  to  compute  probabilities  on  the  test  set. 

This  training-and-testing  paradigm  can  also  be  used  to  evaluate 
different  A' -gram  architectures.  For  example  to  compare  the  different 
smoothing  algorithms  we  will  introduce  in  Section  6.3,  we  can  take 
a large  corpus  and  divide  it  into  a training  set  and  a test  set.  Then 
we  train  the  two  different  /V-gram  models  on  the  training  set  and 
see  which  one  better  models  the  test  set.  But  what  does  it  mean  to 
‘model  the  test  set’?  There  is  a useful  metric  for  how  well  a given 
statistical  model  matches  a test  corpus,  called  perplexity.  Perplexity 
is  a valiant  of  entropy,  and  will  be  introduced  on  page  221. 

In  some  cases  we  need  more  than  one  test  set.  For  example, 
suppose  we  have  a few  different  possible  language  models  and  we 
want  first  to  pick  the  best  one  and  then  to  see  how  it  does  on  a fair 
test  set,  i.e.  one  we’ve  never  looked  at  before.  We  first  use  a devel- 
opment test  set  (also  called  a devtest  set)  to  pick  the  best  language 
model,  and  perhaps  tune  some  parameters.  Then  once  we  come  up 
with  what  we  think  is  the  best  model,  we  run  it  on  the  true  test  set. 

When  comparing  models  it  is  important  to  use  statistical  tests 
(introduced  in  any  statistics  class  or  textbook  for  the  social  sciences) 
to  determine  if  the  difference  between  two  models  is  significant.  Co- 
hen (1995)  is  a useful  reference  which  focuses  on  statistical  research 
methods  for  artificial  intelligence.  Dietterich  (1998)  focuses  on  sta- 
tistical tests  for  comparing  classifiers. 
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local  word-to-word  coherence  (especially  if  we  consider  that  punctuation 
counts  as  a word).  The  trigram  and  quadrigram  sentences  arc  beginning  to 
look  a lot  like  Shakespeare.  Indeed  a careful  investigation  of  the  quadri- 
gram sentences  shows  that  they  look  a little  too  much  like  Shakespeare.  The 
words  It  cannot  be  but  so  arc  directly  from  King  John.  This  is  because 
the  Shakespeare  oeuvre,  while  large  by  many  standards,  is  somewhat  less 
than  a million  words.  Recall  that  Kucera  (1992)  gives  a count  for  Shake- 
speare’s complete  works  at  884,647  words  (tokens)  from  29,066  wordform 
types  (including  proper  nouns).  That  means  that  even  the  bigram  model  is 
very  sparse;  with  29,066  types,  there  arc  29,0662,  or  more  than  844  million 
possible  bigrams,  so  a 1 million  word  training  set  is  clearly  vastly  insufficient 
to  estimate  the  frequency  of  the  rarer  ones;  indeed  somewhat  under  300,000 
different  bigram  types  actually  occur  in  Shakespeare.  This  is  far  too  small  to 
train  quadrigrams;  thus  once  the  generator  has  chosen  the  first  quadrigram 
(It  cannot  be  but),  there  arc  only  5 possible  continuations  (that,  I,  he,  thou, 
and  so)',  indeed  for  many  quadrigrams  there  is  only  one  continuation. 

To  get  an  idea  of  the  dependence  of  a grammar  on  its  training  set, 
let’s  look  at  an  /V-gram  grammar  trained  on  a completely  different  corpus: 
the  Wall  Street  Journal  (WSJ).  A native  speaker  of  English  is  capable  of 
reading  both  Shakespeare  and  the  Wall  Street  Journal;  both  arc  subsets  of 
English.  Thus  it  seems  intuitive  that  our  /V-grams  for  Shakespeare  should 
have  some  overlap  with  /V-grams  from  the  Wall  Street  Journal.  In  order  to 
check  whether  this  is  true,  here  arc  three  sentences  generated  by  unigram, 
bigram,  and  trigram  grammars  trained  on  40  million  words  of  articles  from 
the  daily  Wall  Street  Journal  (these  grammars  arc  Katz  backoff  grammars 
with  Good-Turing  smoothing;  we  will  learn  in  the  next  section  how  these  arc 
constructed).  Again,  we  have  corrected  the  output  by  hand  with  the  proper 
English  capitalization  for  readability. 

1.  (unigram)  Months  the  my  and  issue  of  year  foreign  new  exchange’s 
September  were  recession  exchange  new  endorsed  a acquire  to  six  ex- 
ecutives 

2.  (bigram)  Last  December  through  the  way  to  preserve  the  Hudson  cor- 
poration N.  B.  E.  C.  Taylor  would  seem  to  complete  the  major  central 
planners  one  point  five  percent  of  U.  S.  E.  has  already  old  M.  X.  corpo- 
ration of  living  on  information  such  as  more  frequently  fishing  to  keep 
her 

3.  (trigram)  They  also  point  to  ninety  nine  point  six  billion  dollars  from 
two  hundred  four  oh  six  three  percent  of  the  rates  of  interest  stores  as 
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Mexico  and  Brazil  on  market  conditions 

Compare  these  examples  to  the  pseudo-Shakespeare  on  the  previous 
page;  while  superficially  they  both  seem  to  model  ‘English-like  sentences’ 
there  is  obviously  no  overlap  whatsoever  in  possible  sentences,  and  very  lit- 
tle if  any  overlap  even  in  small  phrases.  The  difference  between  the  Shake- 
speare and  WSJ  corpora  tell  us  that  a good  statistical  approximation  to  En- 
glish will  have  to  involve  a very  large  corpus  with  a very  large  cross-section 
of  different  genres.  Even  then  a simple  statistical  model  like  an  /V-grarn 
would  be  incapable  of  modeling  the  consistency  of  style  across  genres  (we 
would  only  want  to  expect  Shakespearean  sentences  when  we  arc  reading 
Shakespeare,  not  in  the  middle  of  a Wall  Street  Journal  article). 


6.3  Smoothing 


Never  do  I ever  want  to  hear  another  word! 

There  isn  ’t  one, 

I haven ’t  heard! 

Eliza  Doolittle  in  Alan  Jay  Lerner’s  My  Fair  Lady  lyrics 


words  people 
never  use  — 
could  be 
only  I 
know  them 

Ishikawa  Takuboku  1885-1912 

One  major  problem  with  standard  /V-gram  models  is  that  they  must 
be  trained  from  some  corpus,  and  because  any  particular  training  corpus  is 
finite,  some  perfectly  acceptable  English  /V-grams  arc  bound  to  be  missing 
sparse  from  it.  That  is,  the  bigram  matrix  for  any  given  training  corpus  is  sparse; 

it  is  bound  to  have  a very  large  number  of  cases  of  putative  ‘zero  probability 
bigrams'  that  should  really  have  some  non-zero  probability.  Furthermore, 
the  MLE  method  also  produces  poor  estimates  when  the  counts  arc  non-zero 
but  still  small. 

Some  paid  of  this  problem  is  endemic  to  /V-grams;  since  they  can’t 
use  long-distance  context,  they  always  tend  to  underestimate  the  probability 
of  strings  that  happen  not  to  have  occurred  nearby  in  their  training  corpus. 
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But  there  are  some  techniques  we  can  use  to  assign  a non-zero  probability 
to  these  ‘zero  probability  bigrams’.  This  task  of  reevaluating  some  of  the 
zero-probability  and  low-probability  /V-grams,  and  assigning  them  non-zero 
values,  is  called  smoothing.  In  the  next  few  sections  we  will  introduce  some 
smoothing  algorithms  and  show  how  they  modify  the  Berkeley  Restaurant 
bigram  probabilities  in  Figure  6.5. 


Add-One  Smoothing 

One  simple  way  to  do  smoothing  might  be  just  to  take  our  matrix  of  bigram 
counts,  before  we  normalize  them  into  probabilities,  and  add  one  to  all  the 
counts.  This  algorithm  is  called  add-one  smoothing.  Although  this  algo- 
rithm does  not  perform  well  and  is  not  commonly  used,  it  introduces  many 
of  the  concepts  that  we  will  see  in  other  smoothing  algorithms,  and  also  gives 
us  a useful  baseline. 

Let’s  first  consider  the  application  of  add-one  smoothing  to  unigram 
probabilities,  since  that  will  be  simpler.  The  unsmoothed  maximum  likeli- 
hood estimate  of  the  unigram  probability  can  be  computed  by  dividing  the 
count  of  the  word  by  the  total  number  of  word  tokens  N: 


P(WX) 


c(w_x) 

LiC(Wi) 


c(wx) 

N 


The  various  smoothing  estimates  will  rely  on  an  adjusted  count  c*.  The 
count  adjustment  for  add-one  smoothing  can  then  be  defined  by  adding  one 
to  the  count  and  then  multiplying  by  a normalization  factor,  j^y,  where  V 
is  the  total  number  of  word  types  in  the  language,  i.e.  the  vocabulary  size. 
Since  we  arc  adding  1 to  the  count  for  each  word  type,  the  total  number  of 
tokens  must  be  increased  by  the  number  of  types.  The  adjusted  count  for 
add-one  smoothing  is  then  defined  as: 


c*  — (ci  + 1) 


N 


N + V 


(6.13) 


and  the  counts  can  be  turned  into  probabilities  p*  by  normalizing  by  N. 

An  alternative  way  to  view  a smoothing  algorithm  is  as  discounting 
(lowering)  some  non-zero  counts  in  order  to  get  the  probability  mass  that 
will  be  assigned  to  the  zero  counts.  Thus  instead  of  referring  to  the  dis- 
counted counts  c*,  many  papers  also  define  smoothing  algorithms  in  terms 
of  a discount  dc,  the  ratio  of  the  discounted  counts  to  the  original  counts: 


SMOOTHING 


ADD-ONE 


VOCABULARY 

SIZE 


DISCOUNTING 


DISCOUNT 
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c 

Alternatively  we  can  compute  the  probability  p*  directly  from  the  counts 
as  follows: 


Pi  = 


Cj  + 1 

N + V 


Now  that  we  have  the  intuition  for  the  unigram  case,  let’s  smooth 
our  Berkeley  Restaurant  Project  bigram.  Figure  6.6  shows  the  add-one- 
smoothed  counts  for  the  bigram  in  Figure  6.4. 
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1 
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lunch 
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Figure  6.6  Add-one  Smoothed  Bigram  counts  for  7 of  the  words  (out  of 

1616  total  word  types)  in 

the  Berkeley  Restaurant  Project  corpus  of 

'10,000 

sentences. 

Figure  6.7  shows  the  add-one-smoothed  probabilities  for  the  bigram  in 
Figure  6.5.  Recall  that  normal  bigram  probabilities  arc  computed  by  nor- 
malizing each  row  of  counts  by  the  unigram  count: 


P(w„\wn-i) 


C(w„-iw„) 
C(wn- 1) 


(6.14) 


For  add-one-smoothed  bigram  counts  we  need  to  first  augment  the  un- 
igram count  by  the  number  of  total  word  types  in  the  vocabulary  V : 


P*(wn\wn-l) 


CjyVn-iW n)  + 1 

C(wn-i)+V 


(6.15) 


We  need  to  add  V (=  1616)  to  each  of  the  unigram  counts: 
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I 3437+1616  = 5053 

want  1215+1616  = 2931 

to  3256+1616  = 4872 

eat  938+1616  = 2554 

Chinese  213+1616  = 1829 

food  1506+1616  = 3122 

lunch  459+1616  = 2075 

The  result  is  the  smoothed  bigram  probabilities  in  Figure  6.7. 
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Figure  6.7  Add-one  smoothed  bigram  probabilities  for  7 of  the  words  (out 
of  1616  total  word  types)  in  the  Berkeley  Restaurant  Project  corpus  of '10,000 
sentences. 


It  is  often  convenient  to  reconstruct  the  count  matrix  so  we  can  see 
how  much  a smoothing  algorithm  has  changed  the  original  counts.  These 
adjusted  counts  can  be  computed  by  Equation  6.13.  Figure  6.8  shows  the 
reconstructed  counts. 


I 

want 

to 

eat 

Chinese 

food 

lunch 

I 

6 

740 

.68 

10 

.68 

.68 

.68 

want 

2 

.42 

331 

.42 

3 

4 

3 

to 

3 

.69 

8 

594 

3 

.69 

9 

eat 

.37 

.37 

1 

.37 

7.4 

1 

20 

Chinese 

.36 

.12 

.12 

.12 

.12 

15 

.24 

food 

10 

.48 

9 

.48 

.48 

.48 

.48 

lunch 

1.1 

.22 

.22 

.22 

.22 

.44 

.22 

Figure  6.8  Add-one  smoothed  bigram  counts  for  7 of  the  words  (out  of 
1616  total  word  types)  in  the  Berkeley  Restaurant  Project  Corpus  of  '10,000 
sentences. 
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WITTEN-BELL 

DISCOUNTING 


Note  that  add-one  smoothing  has  made  a very  big  change  to  the  counts. 
C(want  to)  changed  from  786  to  331!  We  can  see  this  in  probability  space 
as  well:  P(to\want)  decreases  from  .65  in  the  unsmoothed  case  to  .28  in  the 
smoothed  case. 

Looking  at  the  discount  d (the  ratio  between  new  and  old  counts)  shows 
us  how  strikingly  the  counts  for  each  prefix-word  have  been  reduced;  the 
bigrams  starting  with  Chinese  were  discounted  by  a factor  of  8 ! 


I 

.68 

want 

.42 

to 

.69 

eat 

.37 

Chinese 

.12 

food 

.48 

lunch 

.22 

The  sharp  change  in  counts  and  probabilities  occurs  because  too  much 
probability  mass  is  moved  to  all  the  zeros.  The  problem  is  that  we  arbitrarily 
picked  the  value  “1”  to  add  to  each  count.  We  could  avoid  this  problem  by 
adding  smaller  values  to  the  counts  (‘add-one-half’  ‘add-one-thousandth’), 
but  we  would  need  to  retrain  this  parameter  for  each  situation. 

In  general  add-one  smoothing  is  a poor  method  of  smoothing.  Gale  and 
Church  (1994)  summarize  a number  of  additional  problems  with  the  add-one 
method;  the  main  problem  is  that  add-one  is  much  worse  at  predicting  the 
actual  probability  for  bigrams  with  zero  counts  than  other  methods  like  the 
Good-Turing  method  we  will  describe  below.  Furthermore,  they  show  that 
variances  of  the  counts  produced  by  the  add-one  method  arc  actually  worse 
than  those  from  the  unsmoothed  MLE  method. 

Witten-Bell  Discounting 

A much  better  smoothing  algorithm  that  is  only  slightly  more  complex  than 
Add-One  smoothing  we  will  refer  to  as  Witten-Bell  discounting  (it  is  in- 
troduced as  Method  C in  Witten  and  Bell  (1991)).  Witten-Bell  discounting 
is  based  on  a simple  but  clever  intuition  about  zero-frequency  events.  Let’s 
think  of  a zero-frequency  word  or  /V-gram  as  one  that  just  hasn’t  happened 
yet.  When  it  does  happen,  it  will  be  the  first  time  we  see  this  new  /V-gram. 
So  the  probability  of  seeing  a zero-frequency  /V-gram  can  be  modeled  by  the 
probability  of  seeing  an  /V-gram  for  the  first  time.  This  is  a recurring  concept 
in  statistical  language  processing: 
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Key  Concept  #4.  Things  Seen  Once:  Use  the  count  of  things  you’ve 
seen  once  to  help  estimate  the  count  of  things  you’ve  never  seen. 


The  idea  that  we  can  estimate  the  probability  of  ‘things  we  never  saw’ 
with  help  from  the  count  of  ‘things  we  saw  once’  will  return  when  we  dis- 
cuss Good-Turing  smoothing  later  in  this  chapter,  and  then  once  again  when 
we  discuss  methods  for  tagging  an  unknown  word  with  a part-of-speech  in 
Chapter  8. 

How  can  we  compute  the  probability  of  seeing  an  /V-gram  for  the  first 
time?  By  counting  the  number  of  times  we  saw  N- grams  for  the  first  time  in 
our  training  corpus.  This  is  very  simple  to  produce  since  the  count  of  ‘first- 
time' /V-grams  is  just  the  number  of  /V-gram  types  we  saw  in  the  data  (since 
we  had  to  see  each  type  for  the  first  time  exactly  once). 

So  we  estimate  the  total  probability  mass  of  all  the  zero  A' -grams  with 
the  number  of  types  divided  by  the  number  of  tokens  plus  observed  types: 


E p* 


T 

N + T 


(6.16) 


Why  do  we  normalize  by  the  number  of  tokens  plus  types?  We  can 
think  of  our  training  corpus  as  a series  of  events;  one  event  for  each  token 
and  one  event  for  each  new  type.  So  Equation  6.16  gives  the  Maximum 
Likelihood  Estimate  of  the  probability  of  a new  type  event  occurring.  Note 
that  the  number  of  observed  types  T is  different  than  the  ‘total  types’  or 
‘vocabulary  size  V'  that  we  used  in  add-one  smoothing:  T is  the  types  we 
have  already  seen,  while  V is  the  total  number  of  possible  types  we  might 
ever  see. 

Equation  6.16  gives  the  total  ‘probability  of  unseen  /V-grams’.  We 
need  to  divide  this  up  among  all  the  zero  /V-grams.  We  could  just  choose 
to  divide  it  equally.  Let  Z be  the  total  number  of  /V-grams  with  count  zero 
(types;  there  aren’t  any  tokens).  Each  formerly-zero  unigram  now  gets  its 
equal  share  of  the  redistributed  probability  mass:  z 


z = E 1 

i:c;=0 

* = T 

Pi  Z(N  + T) 


(6.17) 

(6.18) 


If  the  total  probability  of  zero  /V-grams  is  computed  from  Equation  6. 16, 
the  extra  probability  mass  must  come  from  somewhere;  we  get  it  by  dis- 
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Z(WX 


counting  the  probability  of  all  the  seen  /V-grams  as  follows: 

Ci 


Pi  = 


;if  (Ci  > 0) 


N + T 

Alternatively,  we  can  represent  the  smoothed  counts  directly  as: 


C;  = 


LJ. ¥_  if  c.  — o 

Z N+T ’ 11  u 
dj^f,  if  Cj  > 0 


(6.19) 


(6.20) 


Witten-Bell  discounting  looks  a lot  like  add-one  smoothing  for  uni- 
grams. But  if  we  extend  the  equation  to  bigrams  we  will  see  a big  difference. 
This  is  because  now  our  type-counts  arc  conditioned  on  some  history.  In  or- 
der to  compute  the  probability  of  a bigram  wn-iwn-2  we  haven’t  seen,  we 
use  ‘the  probability  of  seeing  a new  bigram  stalling  with  wn  i ’ . This  lets  our 
estimate  of  ‘first-time  bigrams’  be  specific  to  a word  history.  Words  that  tend 
to  occur  in  a smaller  number  of  bigrams  will  supply  a lower  ‘unseen-bigram' 
estimate  than  words  that  arc  more  promiscuous. 

We  represent  this  fact  by  conditioning  T,  the  number  of  bigram  types, 
and  N,  the  number  of  bigram  tokens,  on  the  previous  word  wx  , as  follows: 


X P*{wi\wx) 
i\c(wxWi)= 0 


T(wx) 

N(wx)  + T(wx) 


(6-21) 


Again,  we  will  need  to  distribute  this  probability  mass  among  all  the 
unseen  bigrams.  Let  Z again  be  the  total  number  of  bigrams  with  a given  first 
word  that  have  count  zero  (types;  there  aren’t  any  tokens).  Each  formerly- 
zero  bigram  now  gets  its  equal  share  of  the  redistributed  probability  mass: 

Z(wx)  = £ 1 (6.22) 

i:c(wxwi)= 0 


*,  , N T(wi- 1) 

P +i+i-:)  = z(w,_,)(jv  + r(w,_,))  lf(c“-'-"* 


= 0) 


(6.23) 


As  for  the  non-zero  bigrams,  we  discount  them  in  the  same  manner,  by 
parameterizing  T on  the  history: 


XI  P*(wi\wx)  = 


C(WxWi) 


i:c{wxWi)>  0 


c(wx)  + T (wx) 


(6.24) 


To  use  Equation  6.24  to  smooth  the  restaurant  bigram  from  Figure  6.5, 
we  will  need  the  number  of  bigram  types  T(w)  for  each  of  the  first  words. 
Here  arc  those  values: 
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I 95 

want  76 

to  130 

eat  124 

Chinese  20 

food  82 

lunch  45 

In  addition  we  will  need  the  Z values  for  each  of  these  words.  Since 
we  know  how  many  words  we  have  in  the  vocabulary  (V  = 1,616),  there  arc 
exactly  V possible  bigrams  that  begin  with  a given  word  w,  so  the  number  of 
unseen  bigram  types  with  a given  prefix  is  V minus  the  number  of  observed 
types: 

Z(w)  = V -T(w)  (6.25) 

Here  are  those  Z values: 

I 1,521 

want  1,540 

to  1,486 

eat  1,492 

Chinese  1,596 
food  1,534 

lunch  1,571 

Figure  6.9  shows  the  discounted  restaurant  bigram  counts. 


I want  to  eat  Chinese  food  lunch 
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Figure  6.9  Witten-Bell  smoothed  bigram  counts  for  7 of  the  words  (out  of 
1616  total  word  types)  in  the  Berkeley  Restaurant  Project  corpus  of  '10,000 
sentences. 

The  discount  values  for  the  Witten-Bell  algorithm  are  much  more  rea- 
sonable than  for  add-one  smoothing: 
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I .97 

want  .94 

to  .96 

eat  .88 

Chinese  .91 

food  .94 

lunch  .91 

It  is  also  possible  to  use  Witten-Bell  (or  other)  discounting  in  a differ- 
ent way.  In  Equation  (6.21),  we  conditioned  the  smoothed  bigram  proba- 
bilities on  the  previous  word.  That  is,  we  conditioned  the  number  of  types 
T(wx)  and  tokens  N(wx)  on  the  previous  word  wx.  But  we  could  choose 
instead  to  treat  a bigram  as  if  it  were  a single  event,  ignoring  the  fact  that 
it  is  composed  of  two  words.  Then  T would  be  the  number  of  types  of  all 
bigrams,  and  N would  be  the  number  of  tokens  of  all  bigrams  that  occurred. 
Treating  the  bigrams  as  a unit  in  this  way,  we  arc  essentially  discounting,  not 
probability  the  conditional  probability  P(wj\wx),  but  the  joint  probability  P(wxw,).  In 
this  way  the  probability  P(wxwi)  is  treated  just  like  a unigram  probability. 
This  kind  of  discounting  is  less  commonly  used  than  the  ‘conditional’  dis- 
counting we  walked  through  above  starting  with  equation  6.21.  (Although  it 
is  often  used  for  the  Good-Turing  discounting  algorithm  described  below). 

In  Section  6.4  we  show  that  discounting  also  plays  a role  in  more  so- 
phisticated language  models.  Witten-Bell  discounting  is  commonly  used  in 
speech  recognition  systems  such  as  Placeway  et  al.  (1993). 

Good-Turing  Discounting 

This  section  introduces  a slightly  more  complex  form  of  discounting  than  the 
Turing  Witten-Bell  algorithm  called  Good-Turing  smoothing.  This  section  may  be 
skipped  by  readers  who  arc  not  focusing  on  discounting  algorithms. 

The  Good-Turing  algorithm  was  first  described  by  Good  (1953),  who 
credits  Turing  with  the  original  idea;  a complete  proof  is  presented  in  Church 
et  al.  (1991).  The  basic  insight  of  Good-Turing  smoothing  is  to  re-estimate 
the  amount  of  probability  mass  to  assign  to  A-grams  with  zero  or  low  counts 
by  looking  at  the  number  of  /V-grams  with  higher  counts.  In  other  words, 
we  examine  Nc,  the  number  of  /V-grams  that  occur  c times.  We  refer  to  the 
number  of  /V-grams  that  occur  c times  as  the  frequency  of  frequency  c.  So 
applying  the  idea  to  smoothing  the  joint  probability  of  bigrams,  Nq  is  the 
number  of  bigrams  b of  count  0,  N\  the  number  of  bigrams  with  count  1,  and 
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so  on: 

Nc=  £ 1 (6.26) 

b:c(b)=c 

The  Good-Turing  estimate  gives  a smoothed  count  c*  based  on  the  set 
of  Nc  for  all  c,  as  follows: 

c*  = (c  + l)%!  (6.27) 

For  example,  the  revised  count  for  the  bigrams  that  never  occurred  (co) 
is  estimating  by  dividing  the  number  of  bigrams  that  occurred  once  (the  sin- 
gleton or  'hapax  legomenon’  bigrams  N\)  by  the  number  of  bigrams  that  singleton 
never  occurred  (No).  Using  the  count  of  things  we’ve  seen  once  to  estimate 
the  count  of  things  we’ve  never  seen  should  remind  you  of  the  Witten-Bell 
discounting  algorithm  we  saw  earlier  in  this  chapter.  The  Good-Turing  al- 
gorithm was  first  applied  to  the  smoothing  of  /V-gram  grammars  by  Katz, 
as  cited  in  Nadas  (1984).  Figure  6.10  gives  an  example  of  the  applica- 
tion of  Good-Turing  discounting  to  a bigram  grammar  computed  by  Church 
and  Gale  (1991)  from  22  million  words  from  the  Associated  Press  (AP) 
newswire.  The  first  column  shows  the  count  c,  i.e.  the  number  of  observed 
instances  of  a bigram.  The  second  column  shows  the  number  of  bigrams  that 
had  this  count.  Thus  449,721  bigrams  has  a count  of  2.  The  third  column 
shows  c*,  the  Good-Turing  re-estimation  of  the  count. 


c (MLE) 

Nc 

c*  (GT) 
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74,671,100,000 

0.0000270 

1 

2,018,046 

0.446 

2 

449,721 

1.26 
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188,933 

2.24 
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105,668 

3.24 
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68,379 

4.22 

6 

48,190 

5.19 

7 

35,709 

6.21 

8 

27,710 

7.24 

9 

22,280 

8.25 

Figure  6.10  Bigram  ‘frequencies  of  frequencies’  from  22  million  AP  bi- 
grams, and  Good-Turing  re-estimations  after  Church  and  Gale  (1991) 


Church  et  al.  (1991)  show  that  the  Good-Turing  estimate  relies  on  the 
assumption  that  the  distribution  of  each  bigram  is  binomial.  The  estimate 
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also  assumes  we  know  No,  the  number  of  bigrams  we  haven’t  seen.  We 
know  this  because  given  a vocabulary  size  of  V , the  total  number  of  bigrams 
is  V2.  (No  is  V 2 minus  all  the  bigrams  we  have  seen). 

In  practice,  this  discounted  estimate  c*  is  not  used  for  all  counts  c. 
Large  counts  (where  c > k for  some  threshold  k)  arc  assumed  to  be  reliable. 
Katz  (1987)  suggests  setting  k at  5.  Thus  we  define 

c*  = c for  c > k (6.28) 


The  correct  equation  for  c*  when  some  k is  introduced  (from  Katz 
(1987))  is: 


c*  = 


(c  + 1) 


Nc+i 

Nc 


— C 


(H-l)ATn 

Ni 


1 - 


(£+l)A4+l 

Ni 


for  1 < c < k. 


(6.29) 


With  Good-Turing  discounting  as  with  any  other,  it  is  usual  to  treat 
A-grams  with  low  counts  (especially  counts  of  1)  as  if  the  count  was  0. 


6.4  Backoff 


DELETED  IN- 
TERPOLATION 

BACKOFF 


The  discounting  we  have  been  discussing  so  far  can  help  solve  the  problem  of 
zero  frequency  n-grams.  But  there  is  an  additional  source  of  knowledge  we 
can  draw  on.  If  we  have  no  examples  of  a particular  trigram  wn  2 "7,  1 wn  to 
help  us  compute  P(wn\wn-iwn-2),  we  can  estimate  its  probability  by  using 
the  bigram  probability  P ( wn  \ wn  1 ) . Similarly,  if  we  don’t  have  counts  to 
compute  P(wn\wn-i),  we  can  look  to  the  unigram  P(wn). 

There  arc  two  ways  to  rely  on  this  A-gram  ‘hierarchy’,  deleted  inter- 
polation and  backoff.  We  will  focus  on  backoff,  although  we  give  a quick 
overview  of  deleted  interpolation  after  this  section.  Backoff  A-gram  model- 
ing is  a nonlinear  method  introduced  by  Katz  (1987).  In  the  backoff  model, 
like  the  deleted  interpolation  model,  we  build  an  A-gram  model  based  on  an 
(A-i)-gram  model.  The  difference  is  that  in  backoff,  if  we  have  non-zero 
trigram  counts,  we  rely  solely  on  the  trigram  counts  and  don’t  interpolate 
the  bigram  and  unigram  counts  at  all.  We  only  ‘back  off’  to  a lower-order 
A-gram  if  we  have  zero  evidence  for  a higher-order  A-gram. 

The  trigram  version  of  backoff  might  be  represented  as  follows: 


( P (w i\\V i—2W j—\ ) , if  C(Wi-2Wi-\Wi)  > 0 


P(Wj\\Vj-2Wi-l)  = < 


a\P(wj\wi-\ ) 


[ a 2P{wi), 


if  C(wi-2Wi-iWj)  = 0 
and  C(wi-iWj)  > 0 


(6.30) 


otherwise. 
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Let’s  ignore  the  a values  for  a moment;  we’ll  discuss  the  need  for 
these  weighting  factors  below.  Here’s  a first  pass  at  the  (recursive)  equation 
for  representing  the  general  case  of  this  form  of  backoff. 


^KK-W+l)  — P(W'i\Wn-N+l ) 

+e(Jp(w„|w;;:^+1))aJp(w„|w;;:^+2)  (6.31) 

Again,  ignore  the  a and  the  P for  the  moment.  Following  Katz,  we’ve 
used  9 to  indicate  the  binary  function  that  selects  a lower-ordered  model  only 
if  the  higher-order  model  gives  a zero  probability: 

0(x)  = <f  lfu=°  (6.32) 

y | 0.  otherwise. 

and  each  P(-)  is  a MLE  (i.e.  computed  directly  by  dividing  counts).  The 

next  section  will  work  through  these  equations  in  more  detail.  In  order  to  do 
that,  we’ll  need  to  understand  the  role  of  the  a values  and  how  to  compute 
them. 

Combining  Backoff  with  Discounting 

Our  previous  discussions  of  discounting  showed  how  to  use  a discounting 
algorithm  to  assign  probability  mass  to  unseen  events.  For  simplicity,  we 
assumed  that  these  unseen  events  were  all  equally  probable,  and  so  the  prob- 
ability mass  got  distributed  evenly  among  all  unseen  events.  Now  we  can 
combine  discounting  with  the  backoff  algorithm  we  have  just  seen  to  be  a 
little  more  clever  in  assigning  probability  to  unseen  events.  We  will  use  the 
discounting  algorithm  to  tells  us  how  much  total  probability  mass  to  set  aside 
for  all  the  events  we  haven’t  seen,  and  the  backoff  algorithm  to  tell  us  how 
to  distribute  this  probability  in  a clever  way. 

First,  the  reader  should  stop  and  answer  the  following  question  (don’t 
look  ahead):  Why  did  we  need  the  a values  in  Equation  6.30  (or  Equa- 
tion 6.31)?  Why  couldn't  we  just  have  three  sets  of  probabilities  without 
weights? 

The  answer:  without  a values,  the  result  of  the  equation  would  not  be 
a true  probability!  This  is  because  the  original  F’(w„|w”liV+1)  we  got  from 
relative  frequencies  were  true  probabilities,  i.e.  if  we  sum  the  probability  of 
a given  wn  over  all  /V-gram  contexts,  we  should  get  1 : 

Y^P{wn\WjWj)  = 1 

hi 


(6.33) 
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But  if  that  is  the  case,  if  we  back  off  to  a lower  order  model  when  the 
probability  is  zero,  we  arc  adding  extra  probability  mass  into  the  equation, 
and  the  total  probability  of  a word  will  be  greater  than  1 ! 

Thus  any  backoff  language  model  must  also  be  discounted.  This  ex- 
p plains  the  as  and  P in  Equation  6.3 1 . The  P comes  from  our  need  to  discount 

the  MLE  probabilities  to  save  some  probability  mass  for  the  lower-order  N- 
grams.  We  will  use  P to  mean  discounted  probabilities,  and  save  P for  plain 
old  relative  frequencies  computed  directly  from  counts.  The  a is  used  to  en- 
sure that  the  probability  mass  from  all  the  lower  order  /V- grams  sums  up  to 
exactly  the  amount  that  we  saved  by  discounting  the  higher-order  /V-grams. 
Here’s  the  correct  final  equation: 


P{WnWl_lN+i) 


^(w«W_N+l) 

+6(p(w„|w”:^+1)) 


(6.34) 


Now  let’s  see  the  formal  definition  of  each  of  these  components  of  the 
equation.  We  define  P as  the  discounted  (c*)  MLE  estimate  of  the  conditional 
probability  of  an  N-gram,  as  follows: 


P(y 


w„\w 


n—  1 \ 

n— iV+1  J 


C [W 


n-N+1) 


c(w'l 


,n — N 1 \ 


(6.35) 


This  probability  P will  be  slightly  less  than  the  MLE  estimate  ‘L|u;cf+1| 

(i.e.  on  average  the  c*  will  be  less  than  c).  This  will  leave  some  probability 
mass  for  the  lower  order  N-grams.  Now  we  need  to  build  the  a weighting 
we’ll  need  for  passing  this  mass  to  the  lower-order  N-grams.  Let’s  represent 
the  total  amount  of  left-over  probability  mass  by  the  function  p,  a function  of 
the  N — 1-gram  context.  For  a given  N — 1-gram  context,  the  total  left-over 
probability  mass  can  be  computed  by  subtracting  from  1 the  total  discounted 
probability  mass  for  all  N-grams  stalling  with  that  context: 


HK-N+i)  = 1 - I P{wn\wnn-N+l)  (6-36) 

»'":cK-k+i)>0 


This  gives  us  the  total  probability  mass  that  we  are  ready  to  distribute 
to  all  N — 1-gram  (e.g.  bigrams  if  our  original  model  was  a trigram).  Each 
individual  N — 1-gram  (bigram)  will  only  get  a fraction  of  this  mass,  so  we 
need  to  normalize  p by  the  total  probability  of  all  the  N — 1 -grams  (bigrams) 
that  begin  some  N-gram  (trigram).  The  final  equation  for  computing  how 
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much  probability  mass  to  distribute  from  an  /V-grarn  to  an  /V  — 1-gram  is 
represented  by  the  function  a: 

a (Wn-XN+X ) = 1 \Wn~N+l) 

v n— A'H-l/  i | n—  1 \ v 7 

1 a-i./JX)'  (lr'i  A'  -2/ 

Note  that  a is  a function  of  the  preceding  word  string,  i.e.  of  wnn~ N+  \ ; 
thus  the  amount  by  which  we  discount  each  trigram  (d),  and  the  mass  that 
gets  reassigned  to  lower-order  /V-grams  (a)  are  recomputed  for  every  N- 
gram  (more  accurately  for  every  N — 1-gram  that  occurs  in  any  /V-grarn). 


We  only  need  to  specify  what  to  do  when  the  counts  of  an  /V  — 1-gram 
context  are  0,  (i.e.  when  c(w™_N+])  = 0)  and  our  definition  is  complete: 


P{wn  w^Jv+!)  ~ 

pkicK) 

(6.38) 

and 

P(wn  W^-AT+l ) = 

0 

(6.39) 

and 

HK-N+l)  = 1 

(6.40) 

In  Equation  6.35,  the  discounted  probability  P can  be  computed  with 
the  discounted  counts  c*  from  the  Witten-Bell  discounting  (Equation  6.20) 
or  with  the  Good-Turing  discounting  discussed  below. 


Here  is  the  backoff  model  expressed  in  a slightly  clearer  format  in  its 
trigram  version: 


'i\Wi-2Wj-  j), 

if  C(Wj-2Wi-iWj)  > 0 

’n-2)P(wi\w‘- l)f 

ifC(w/_2w1_iwI)  =0 

and  C(w,_iH',)  > 0 

otherwise. 

P(Wi\\Vj-2Wi-l)  = < 


In  practice,  when  discounting,  we  usually  ignore  counts  of  1,  i.e.  we 
treat  /V-granis  with  a count  of  1 as  if  they  never  occurred. 

Gupta  et  al.  (1992)  present  a valiant  backoff  method  of  assigning  prob- 
abilities to  zero  trigrams. 


6.5  Deleted  Interpolation 


The  deleted  interpolation  algorithm,  due  to  Jelinek  and  Mercer  (1980),  com- 
bines different  /V-gram  orders  by  linearly  interpolating  all  three  models  when- 
ever we  arc  computing  any  trigram.  That  is,  we  estimate  the  probability 
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P(wn\wn-iwn-2)  by  mixing  together  the  unigram,  bigram,  and  trigram  prob- 
abilities. Each  of  these  is  weighted  by  a linear  weight  X: 


P{w„\wn-iwn-2)  = 'X\P(wn\wn-iwn-2) 
+X2P{w„\wn-\) 
+X3P  (wn ) 


(6.41) 


such  that  the  A.s  sum  to  1 : 


Ev  = i 


(6.42) 


DELETED  IN- 
TERPOLATION 


In  practice,  in  this  deleted  interpolation  algorithm  we  don’t  train  just 
three  Xs  for  a trigram  grammar.  Instead,  we  make  each  X a function  of  the 
context.  This  way  if  we  have  particularly  accurate  counts  for  a particular 
bigram,  we  assume  that  the  counts  of  the  trigrams  based  on  this  bigram  will 
be  more  trustworthy,  and  so  we  can  make  the  lambdas  for  those  trigrams 
higher  and  thus  give  that  trigram  more  weight  in  the  interpolation.  So  a 
more  detailed  version  of  the  interpolation  formula  would  be: 


P(w„\wn-2Wn-\)  = X\(wnn_\)P(wn\wn-2Wn-\) 

+X2(w'l~2)P{wn\wn- 1) 

+ X3(wnnzl)P(wn)  (6.43) 

Given  the  P(w„)  values,  the  X values  are  trained  so  as  to  maximize  the 
likelihood  of  a held-out  corpus  separate  from  the  main  training  corpus,  using 
a version  of  the  EM  algorithm  defined  in  Chapter  7 (Baum,  1972;  Dempster 
et  al.,  1977;  Jelinek  and  Mercer,  1980).  Further  details  of  the  algorithm  are 
described  in  Bahl  et  al.  (1983). 


6.6  iV-GRAMS  FOR  SPELLING  AND  PRONUNCIATION 

In  Chapter  5 we  saw  the  use  of  the  Bayesian/noisy-channel  algorithm  for 
correcting  spelling  errors  and  for  picking  a word  given  a surface  pronunci- 
ation. We  saw  that  both  these  algorithms  failed,  returning  the  wrong  word, 
because  they  had  no  way  to  model  the  probability  of  multiple-word  strings. 
Now  that  our  //-grams  give  us  such  a model,  we  return  to  these  two  problems. 
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Context-Sensitive  Spelling  Error  Correction 

Chapter  5 introduced  the  idea  of  detecting  spelling  errors  by  looking  for 
words  that  arc  not  in  a dictionary,  arc  not  generated  by  some  finite-state 
model  of  English  word-formation,  or  have  low  probability  orthotactics.  But 
none  of  these  techniques  is  sufficient  to  detect  and  correct  real-word  spelling 
errors,  real-word  error  detection.  This  is  the  class  of  errors  that  result 
in  an  actual  word  of  English.  This  can  happen  from  typographical  errors 
(insertion,  deletion,  transposition)  that  accidently  produce  a real  word  (e.g. 
there  for  three),  or  because  the  writer  substituted  the  wrong  spelling  of  a 
homophone  or  near-homophone  (e.g.  dessert  for  desert,  or  piece  for  peace). 
The  task  of  correcting  these  errors  is  called  context-sensitive  spelling  error 
correction. 

How  important  arc  these  errors?  By  an  a priori  analysis  of  single  typo- 
graphical errors  (single  insertions,  deletions,  substitutions,  or  transpositions) 
Peterson  (1986)  estimates  that  15%  of  such  spelling  errors  produce  valid  En- 
glish words  (given  a very  large  list  of  350,000  words).  Kukich  (1992)  sum- 
marizes a number  of  other  analyses  based  on  empirical  studies  of  corpora, 
which  give  figures  between  of  25%  and  40%  for  the  percentage  of  errors 
that  arc  valid  English  words.  Figure  6.1 1 gives  some  examples  from  Kukich 
(1992),  broken  down  into  local  and  global  errors.  Local  errors  arc  those  that 
arc  probably  detectable  from  the  immediate  surrounding  words,  while  global 
errors  arc  ones  in  which  error  detection  requires  examination  of  a large  con- 
text. 

One  method  for  context-sensitive  spelling  error  correction  is  based  on 
/V-grams. 

The  word  /V-grarn  approach  to  spelling  error  detection  and  correction 
was  proposed  by  Mays  et  al.  (1991).  The  idea  is  to  generate  every  possible 
misspelling  of  each  word  in  a sentence  either  just  by  typographical  modifica- 
tions (letter  insertion,  deletion,  substitution),  or  by  including  homophones  as 
well,  (and  presumably  including  the  correct  spelling),  and  then  choosing  the 
spelling  that  gives  the  sentence  the  highest  prior  probability.  That  is,  given 
a sentence  W = {wi,W2,  ■ . ■ ,Wk, . . . , wn},  where  wy  has  alternative  spelling 
w'k,  w'l,  etc,  we  choose  the  spelling  among  these  possible  spellings  that  max- 
imizes P(W),  using  the  /V-gram  grammar  to  compute  P(W).  A class-based 
/V-grarn  can  be  used  instead,  which  can  find  unlikely  part-of-speech  combi- 
nations, although  it  may  not  do  as  well  at  to  finding  unlikely  word  combina- 
tions. 

There  arc  many  other  statistical  approaches  to  context-sensitive  spelling 


REAL-WORD 

ERROR 

DETECTION 
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Local  Errors 

The  study  was  conducted  mainly  be  John  Black. 

They  arc  leaving  in  about  fifteen  minuets  to  go  to  her  house. 

The  design  an  construction  of  the  system  will  take  more  than  a year. 
Hopefully,  all  with  continue  smoothly  in  my  absence. 

Can  they  lave  him  my  messages? 

I need  to  notified  the  bank  of  [this  problem.] 

He  need  to  go  there  right  no  w. 

He  is  trying  to  fine  out. 

Global  Errors 

Won’t  they  heave  if  next  Monday  at  that  time? 

This  thesis  is  supported  by  the  fact  that  since  1989  the  system 

has  been  operating  system  with  all  four  units  on-line,  but  . . . 

Figure  6.11  Some  attested  real-word  spelling  errors  from  Kukich  (1992), 
broken  down  into  local  and  global  errors. 


error  correction,  some  proposed  directly  for  spelling,  other  for  more  general 
types  of  lexical  disambiguation  (such  as  word-sense  disambiguation  or  ac- 
cent restoration).  Beside  the  trigram  approach  we  have  just  described,  these 
include  Bayesian  classifiers,  alone  or  combined  with  trigrams  (Gale  et  al., 
1993;  Golding,  1997;  Golding  and  Schabes,  1996),  decision  lists  (Yarowsky, 
1994),  transformation  based  learning  (Mangu  and  Brill,  1997),  latent  se- 
mantic analysis  (Jones  and  Martin,  1997),  and  Winnow  (Golding  and  Roth, 
1999).  In  a comparison  of  these,  Golding  and  Roth  (1999)  found  the  Win- 
now algorithm  gave  the  best  performance.  In  general,  however,  these  algo- 
rithms are  very  similar  in  many  ways;  they  arc  all  based  on  features  like 
word  and  part-of-speech  /V-grams.  and  Roth  (1998,  1999)  shows  that  many 
of  them  make  their  predictions  using  a family  of  linear  predictors  called  Lin- 
ear Statistical  Queries  (LSQ)  hypotheses.  Chapter  17  will  define  all  these 
algorithms  and  discuss  these  issues  further  in  the  context  of  word-sense  dis- 
ambiguation. 

(V-grams  for  Pronunciation  Modeling 

The  /V-grarn  model  can  also  be  used  to  get  better  performance  on  the  words- 
from-pronunciation  task  that  we  studied  in  Chapter  5.  Recall  that  the  input 
was  the  pronunciation  [n  iy]  following  the  word  I.  We  said  that  the  five  words 
that  could  be  pronounced  [n  iy]  were  need,  new,  neat,  the,  and  knee.  The 
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algorithm  in  Chapter  5 was  based  on  the  product  of  the  unigram  probability 
of  each  word  and  the  pronunciation  likelihood,  and  incorrectly  chose  the 
word  new,  based  mainly  on  its  high  unigram  probability. 

Adding  a simple  bigram  probability,  even  without  proper  smoothing,  is 
enough  to  solve  this  problem  correctly.  In  the  following  table  we  fix  the  table 
on  page  165  by  using  a bigram  rather  than  unigram  word  probability  p(w) 
for  each  of  the  five  candidate  words  (given  that  the  word  I occurs  64,736 
times  in  the  combined  Brown  and  Switchboard  corpora): 

Word  C(Tw)  C(T  w)+0.5  p(w|T) 


need 

153 

153.5 

.0016 

new 

0 

0.5 

.000005 

knee 

0 

0.5 

.000005 

the 

17 

17.5 

.00018 

neat 

0 

0.5 

.000005 

Incorporating  this  new  word  probability  into  combined  model,  it  now 
predicts  the  correct  word  need,  as  the  table  below  shows: 


Word  p(y  w)  p(w) 

p(y  w)p(w) 

need 

.11 

.0016 

.00018 

knee 

1.00 

.000005 

.000005 

neat 

.52 

.000005 

.0000026 

new 

.36 

.000005 

.0000018 

the 

0 

.00018 

0 

6.7  Entropy 

I got  the  horse  right  here 

Frank  Loesser,  Guys  and  Dolls 

Entropy  and  perplexity  are  the  most  common  metrics  used  to  evaluate 
/V-gram  systems.  The  next  sections  summarize  a few  necessary  fundamental 
facts  about  information  theory  and  then  introduce  the  entropy  and  perplex- 
ity metrics.  We  strongly  suggest  that  the  interested  reader  consult  a good 
information  theory  textbook;  Cover  and  Thomas  (1991)  is  one  excellent  ex- 
ample. 

Entropy  is  a measure  of  information,  and  is  invaluable  in  natural  lan-  entropy 
guage  processing,  speech  recognition,  and  computational  linguistics.  It  can 
be  used  as  a metric  for  how  much  information  there  is  in  a particular  gram- 
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mar,  for  how  well  a given  grammar  matches  a given  language,  for  how  pre- 
dictive a given  /V-gram  grammar  is  about  what  the  next  word  could  be.  Given 
two  grammars  and  a corpus,  we  can  use  entropy  to  tell  us  which  grammar 
better  matches  the  corpus.  We  can  also  use  entropy  to  compare  how  diffi- 
cult two  speech  recognition  tasks  arc,  and  also  to  measure  how  well  a given 
probabilistic  grammar  matches  human  grammars. 

Computing  entropy  requires  that  we  establish  a random  variable  X that 
ranges  over  whatever  we  arc  predicting  (words,  letters,  parts  of  speech,  the 
set  of  which  we’ll  call  %),  and  that  has  a particular-  probability  function,  call 
it  p{x).  The  entropy  of  this  random  variable  X is  then 

H(X)  = -Y,p{x)log2p(x)  (6.44) 

The  log  can  in  principle  be  computed  in  any  base;  recall  that  we  use  log 
base  2 in  all  calculations  in  this  book.  The  result  of  this  is  that  the  entropy  is 
measured  in  bits. 

The  most  intuitive  way  to  define  entropy  for  computer  scientists  is  to 
think  of  the  entropy  as  a lower  bound  on  the  number  of  bits  it  would  take 
to  encode  a certain  decision  or  piece  of  information  in  the  optimal  coding 
scheme. 

Cover  and  Thomas  (1991)  suggest  the  following  example.  Imagine 
that  we  want  to  place  a bet  on  a horse  race  but  it  is  too  far  to  go  all  the  way 
to  Yonkers  Racetrack,  and  we’d  like  to  send  a short  message  to  the  bookie 
to  tell  him  which  horse  to  bet  on.  Suppose  there  are  eight  horses  in  this 
particular  race. 

One  way  to  encode  this  message  is  just  to  use  the  binary  representation 
of  the  horse’s  number  as  the  code;  thus  horse  1 would  be  0 01,  horse  2 010, 
horse  3 011,  and  so  on,  with  horse  8 coded  as  0 0 0.  If  we  spend  the  whole 
day  betting,  and  each  horse  is  coded  with  3 bits,  on  the  average  we  would  be 
sending  3 bits  per  race. 

Can  we  do  better?  Suppose  that  the  spread  is  the  actual  distribution  of 
the  bets  placed,  and  that  we  represent  it  as  the  prior  probability  of  each  horse 
as  follows: 


Horse  1 A 

Horse  5 4. 

Horse  2 | 

Horse  6 ^ 

Horse  3 A 

Horse  7 4_ 

i 

Horse  4 tt 
16 

Horse  8 ^ 

The  entropy  of  the  random  variable  X that  ranges  over  horses  gives  us 
a lower  bound  on  the  number  of  bits,  and  is: 
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H{X)  = log /?(;') 

!=  1 

= — j log  j-|log 5— g log log  4(plog  p| 

= 2 bits  (6.45) 

A code  that  averages  2 bits  per  race  can  be  built  by  using  short  en- 
codings for  more  probable  horses,  and  longer  encodings  for  less  probable 
horses.  For  example,  we  could  encode  the  most  likely  horse  with  the  code 
0,  and  the  remaining  horses  as  10,  then  110,  1110,  11110  0,  111101, 
111110,  and  111111. 

What  if  the  horses  are  equally  likely?  We  saw  above  that  if  we  use  an 
equal-length  binary  code  for  the  horse  numbers,  each  horse  took  3 bits  to 
code,  and  so  the  average  was  3.  Is  the  entropy  the  same?  In  this  case  each 
horse  would  have  a probability  of  The  entropy  of  the  choice  of  horses  is 
then: 

w)=-L^°4=-io4=3bits  (6-46) 

The  value  2H  is  called  the  perplexity  (Jelinek  et  ah,  1977;  Bahl  el  ah, 
1983).  Perplexity  can  be  intuitively  thought  of  as  the  weighted  average  num- 
ber of  choices  a random  variable  has  to  make.  Thus  choosing  between  8 
equally  likely  horses  (where  H = 3 bits),  the  perplexity  is  23  or  8.  Choosing 
between  the  biased  horses  in  the  table  above  (where  H = 2 bits),  the  perplex- 
ity is  2 2 or  4. 

Until  now  we  have  been  computing  the  entropy  of  a single  variable. 
But  most  of  what  we  will  use  entropy  for  involves  sequences',  for  a grammar, 
for  example,  we  will  be  computing  the  entropy  of  some  sequence  of  words 
W = {.. . wq.wi , W2, . . . . wn\.  One  way  to  do  this  is  to  have  a variable  that 
ranges  over  sequences  of  words.  For  example  we  can  compute  the  entropy 
of  a random  variable  that  ranges  over  all  finite  sequences  of  words  of  length 
b in  some  language  L as  follows: 

H(wi,w2,...,wn)  = - Y P(wi)l°EP(w")  (6.47) 

wfeL 

We  could  define  the  entropy  rate  (we  could  also  think  of  this  as  the 
per-word  entropy)  as  the  entropy  of  this  sequence  divided  by  the  number 
of  words: 

-h(w[i)  = --  y p(wn\ogP(wn 

n nw?eL 


PERPLEXITY 


ENTROPY 

RATE 


(6.48) 
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But  to  measure  the  true  entropy  of  a language,  we  need  to  consider 
sequences  of  infinite  length.  If  we  think  of  a language  as  a stochastic  process 
L that  produces  a sequence  of  words,  its  entropy  rate  H(L)  is  defined  as: 

H(L)  = lim  —H (w\,W2,  ■ ■ ■ ,wn) 

n— >oo  yi 

= lim  - Y p(wi,...,wn)logp(wi,...,wn)  (6.49) 

n weL 

The  Shannon-McMillan-Breiman  theorem  (Algoet  and  Cover,  1988; 
Cover  and  Thomas,  1991)  states  that  if  the  language  is  regular  in  certain 
ways  (to  be  exact,  if  it  is  both  stationary  and  ergodic), 

H(L)  = lim  — \ogp(w\W2.-.wn)  (6.50) 

n— r°°  n 

That  is,  we  can  take  a single  sequence  that  is  long  enough  instead 
of  summing  over  all  possible  sequences.  The  intuition  of  the  Shannon- 
McMillan-Breiman  theorem  is  that  a long  enough  sequence  of  words  will 
contain  in  it  many  other  shorter  sequences,  and  that  each  of  these  shorter  se- 
quences will  reoccur  in  the  longer  sequence  according  to  their  probabilities. 

A stochastic  process  is  said  to  be  stationary  if  the  probabilities  it  as- 
signs to  a sequence  are  invariant  with  respect  to  shifts  in  the  time  index.  In 
other  words,  the  probability  distribution  for  words  at  time  t is  the  same  as  the 
probability  distribution  at  time  t + 1.  Markov  models,  and  hence  /V-grams, 
arc  stationary.  For  example,  in  a bigram,  P,  is  dependent  only  on  P,  \ . So  if 
we  shift  our  time  index  by  x,  Pl+X  is  still  dependent  on  Pj+X-  But  natural 
language  is  not  stationary,  since  as  we  will  see  in  Chapter  9,  the  probability 
of  upcoming  words  can  be  dependent  on  events  that  were  arbitrarily  distant 
and  time  dependent.  Thus  our  statistical  models  only  give  an  approximation 
to  the  correct  distributions  and  entropies  of  natural  language. 

To  summarize,  by  making  some  incorrect  but  convenient  simplifying 
assumptions,  we  can  compute  the  entropy  of  some  stochastic  process  by  tak- 
ing a very  long  sample  of  the  output,  and  computing  its  average  log  probabil- 
ity. In  the  next  section  we  talk  about  the  why  and  how;  why  we  would  want  to 
do  this  (i.e.  for  what  kinds  of  problems  would  the  entropy  tell  us  something 
useful),  and  how  to  compute  the  probability  of  a very  long  sequence. 

Cross  Entropy  for  Comparing  Models 

In  this  section  we  introduce  the  cross  entropy,  and  discuss  its  usefulness  in 
comparing  different  probabilistic  models.  The  cross  entropy  is  useful  when 
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we  don’t  know  the  actual  probability  distribution  p that  generated  some  data. 
It  allows  us  to  use  some  m,  which  is  a model  of  p (i.e.  an  approximation  to 
p.  The  cross-entropy  of  m on  p is  defined  by: 

H(p,m)  = lim  - V p(wH . . . ,w„)log«j(wi, . . . ,wn)  (6.51) 

n weL 

That  is  we  draw  sequences  according  to  the  probability  distribution  p, 
but  sum  the  log  of  their  probability  according  to  m. 

Again,  following  the  Shannon-McMillan-Breiman  theorem,  for  a sta- 
tionary ergodic  process: 

H(p,m)  = lim  — logm(wiW2 . . . wn)  (6.52) 

n— »°o  n 

What  makes  the  cross  entropy  useful  is  that  the  cross  entropy  H(p.m) 
is  an  upper  bound  on  the  entropy  H(p).  For  any  model  m: 

H(p)<H(p,m)  (6.53) 

This  means  that  we  can  use  some  simplified  model  m to  help  estimate 
the  true  entropy  of  a sequence  of  symbols  drawn  according  to  probability 
p.  The  more  accurate  m is,  the  closer  the  cross  entropy  H(p.m)  will  be  to 
the  true  entropy  H(p).  Thus  the  difference  between  H(p,m)  and  H (p)  is 
a measure  of  how  accurate  a model  is.  Between  two  models  mi  and  m2, 
the  more  accurate  model  will  be  the  one  with  the  lower  cross-entropy.  (The 
cross-entropy  can  never  be  lower  than  the  true  entropy,  so  a model  cannot 
err  by  underestimating  the  true  entropy). 

The  Entropy  of  English 

As  we  suggested  in  the  previous  section,  the  cross-entropy  of  some  model 
m can  be  used  as  an  upper  bound  on  the  true  entropy  of  some  process.  We 
can  use  this  method  to  get  an  estimate  of  the  true  entropy  of  English.  Why 
should  we  care  about  the  entropy  of  English? 

One  reason  is  that  the  true  entropy  of  English  would  give  us  a solid 
lower  bound  for  all  of  our  future  experiments  on  probabilistic  grammars. 
Another  is  that  we  can  use  the  entropy  values  for  English  to  help  under- 
stand what  parts  of  a language  provide  the  most  information  (for  example, 
is  the  predictability  of  English  mainly  based  on  word  order,  on  semantics, 
on  morphology,  on  constituency,  or  on  pragmatic  cues?)  This  can  help  us 
immensely  in  knowing  where  to  focus  our  language-modeling  efforts. 

There  arc  two  common  methods  for  computing  the  entropy  of  English. 
The  first  was  employed  by  Shannon  (1951),  as  paid  of  his  groundbreaking 


Methodology  Box:  Perplexity 


The  methodology  box  on  page  202  mentioned  the  idea  of  com- 
puting the  perplexity  of  a test  set  as  a way  of  comparing  two  prob- 
abilistic models.  (Despite  the  risk  of  ambiguity,  we  will  follow  the 
speech  and  language  processing  literature  in  using  the  term  ‘perplex- 
ity’ rather  than  the  more  technically  correct  term  ‘cross-perplexity’.) 
Here’s  an  example  of  perplexity  computation  as  paid  of  a ‘business 
news  dictation  system’.  We  trained  unigram,  bigram,  and  trigram 
Katz-style  backoff  grammars  with  Good-Turing  discounting  on  38 
million  words  (including  start-of-sentence  tokens)  from  the  Wall 
Street  Journal  (from  the  WSJO  corpus  (LDC,  1993)).  We  used  a 
vocabulary  of  19,979  words  (i.e.  the  rest  of  the  words  types  were 
mapped  to  the  unknown  word  token  <UNK>  in  both  training  and 
testing).  We  then  computed  the  perplexity  of  each  of  these  models 
on  a test  set  of  1.5  million  words  (where  the  perplexity  is  defined  as 
2ntPjn)).  The  table  below  shows  the  perplexity  of  a 1.5  million  word 
WSJ  test  set  according  to  each  of  these  grammars. 

/V-gram  order  Perplexity 
Unigram  962 

Bigram  170 

Trigram  109 

In  computing  perplexities  the  model  m must  be  constructed 
without  any  knowledge  of  the  test  set  t.  Any  kind  of  knowledge 
of  the  test  set  can  cause  the  peiplexity  to  be  artificially  low.  For 
example,  sometimes  instead  of  mapping  all  unknown  words  to  the 
<UNK>  token,  we  use  a closed-vocabulary  test  set  in  which  we 
know  in  advance  what  the  set  of  words  is.  This  can  greatly  reduce 
the  perplexity.  As  long  as  this  knowledge  is  provided  equally  to  each 
of  the  models  we  arc  comparing,  the  closed-vocabulary  perplexity  is 
still  a useful  metric  for  comparing  models.  But  this  cross-perplexity 
is  no  longer  guaranteed  to  be  greater  than  the  true  perplexity  of  the 
test  set,  and  so  great  care  must  be  taken  in  interpreting  the  results.  In 
general,  the  perplexity  of  two  language  models  is  only  comparable 
if  they  use  the  same  vocabulary. 
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work  in  defining  the  field  of  information  theory.  His  idea  was  to  use  human 
subjects,  and  to  construct  a psychological  experiment  that  requires  them  to 
guess  strings  of  letters;  by  looking  at  how  many  guesses  it  takes  them  to 
guess  letters  correctly  we  can  estimate  the  probability  of  the  letters,  and 
hence  the  entropy  of  the  sequence. 

The  actual  experiment  is  designed  as  follows:  we  present  a subject 
with  some  English  text  and  ask  the  subject  to  guess  the  next  letter.  The 
subjects  will  use  their  knowledge  of  the  language  to  guess  the  most  proba- 
ble letter  first,  the  next  most  probable  next,  etc.  We  record  the  number  of 
guesses  it  takes  for  the  subject  to  guess  correctly.  Shannon’s  insight  was  that 
the  entropy  of  the  number-of-guesses  sequence  is  the  same  as  the  entropy 
of  English.  (The  intuition  is  that  given  the  number-of-guesses  sequence,  we 
could  reconstruct  the  original  text  by  choosing  the  “nth  most  probable”  letter 
whenever  the  subject  took  n guesses).  This  methodology  requires  the  use  of 
letter  guesses  rather  than  word  guesses  (since  the  subject  sometimes  has  to 
do  an  exhaustive  search  of  all  the  possible  letters!),  and  so  Shannon  com- 
puted the  per-letter  entropy  of  English  rather  than  the  per-word  entropy. 
He  reported  an  entropy  of  1.3  bits  (for  27  characters  (26  letters  plus  space)). 
Shannon’s  estimate  is  likely  to  be  too  low,  since  it  is  based  on  a single  text 
(, Jefferson  the  Virginian  by  Dumas  Malone).  Shannon  notes  that  his  subjects 
had  worse  guesses  (hence  higher  entropies)  on  other  texts  (newspaper  writ- 
ing, scientific  work,  and  poetry).  More  recently  variations  on  the  Shannon 
experiments  include  the  use  of  a gambling  paradigm  where  the  subjects  get 
to  bet  on  the  next  letter  (Cover  and  King,  1978;  Cover  and  Thomas,  1991). 

The  second  method  for  computing  the  entropy  of  English  helps  avoid 
the  single-text  problem  that  confounds  Shannon’s  results.  This  method  is  to 
take  a very  good  stochastic  model,  train  it  on  a very  large  corpus,  and  use 
it  to  assign  a log-probability  to  a very  long  sequence  of  English,  using  the 
Shannon-McMillan-Breiman  theorem: 

//(English)  < lim --logm(wiW2 . . .w„)  (6.54) 

n— >oo  ji 

For  example.  Brown  et  al.  (1992)  trained  a trigram  language  model 
on  583  million  words  of  English,  (293,181  different  types)  and  used  it  to 
compute  the  probability  of  the  entire  Brown  corpus  (1,014,312  tokens).  The 
training  data  include  newspapers,  encyclopedias,  novels,  office  correspon- 
dence, proceedings  of  the  Canadian  parliament,  and  other  miscellaneous 
sources. 

They  then  computed  the  character-entropy  of  the  Brown  corpus,  by  us- 
ing their  word-trigram  grammar  to  assign  probabilities  to  the  Brown  corpus, 
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considered  as  a sequence  of  individual  letters.  They  obtained  an  entropy 
of  1.75  bits  per  character  (where  the  set  of  characters  included  all  the  95 
printable  ASCII  characters). 

The  average  length  of  English  written  words  (including  space)  has  been 
reported  at  5.5  letters  (Nadas,  1984).  If  this  is  correct,  it  means  that  the  Shan- 
non estimate  of  1.3  bits  per  letter  corresponds  to  a per-word  perplexity  of  142 
for  general  English.  The  numbers  we  report  above  for  the  WSJ  experiments 
arc  significantly  lower  since  the  training  and  test  set  came  from  same  sub- 
sample of  English.  That  is,  those  experiments  underestimate  the  complexity 
of  English  since  the  Wall  Street  Journal  looks  very  little  like  Shakespeare. 


Bibliographical  and  Historical  Notes 

The  underlying  mathematics  of  the  /V-grarn  was  first  proposed  by  Markov 
(1913),  who  used  what  arc  now  called  simple  Markov  chains  or  bigrams 
to  model  sequences  of  20,000  vowels  and  consonants  in  Pushkin’s  Eugene 
Onegin.  Markov  classified  each  letter  as  V or  C and  computed  the  prob- 
ability of  occurrence  of  sequences  such  as  VVV,  VCV,  CVC,  etc.  Shan- 
non (1948)  applied  /V-grams  to  compute  approximations  to  English  word 
sequences.  Based  on  Shannon’s  work,  Markov  models  were  commonly  used 
in  modeling  word  sequences  by  the  1950's.  In  a series  of  extremely  influ- 
ential papers  starting  with  Chomsky  (1956)  and  including  Chomsky  (1957) 
and  Miller  and  Chomsky  (1963),  Noam  Chomsky  argued  that  ‘finite-state 
Markov  processes’,  while  a possibly  useful  engineering  heuristic,  were  in- 
capable of  being  a complete  cognitive  model  of  human  grammatical  knowl- 
edge. These  arguments  led  many  linguists  and  computational  linguists  away 
from  statistical  models  altogether. 

The  resurgence  of  /V-grarn  models  came  from  Jelinek,  Mercer,  Bahl, 
and  colleagues  at  the  IBM  Thomas  J.  Watson  Research  Center,  influenced 
by  Shannon,  and  Baker  at  CMU,  influenced  by  the  work  of  Baum  and  col- 
leagues. These  two  labs  independently  successfully  used  N-grams  in  their 
speech  recognition  systems  (Jelinek,  1976;  Baker,  1975;  Bahl  et  al.,  1983). 
The  Good-Turing  algorithm  was  first  applied  to  the  smoothing  of  /V-gram 
grammars  at  IBM  by  Katz,  as  cited  in  Nadas  (1984).  Jelinek  (1990)  summa- 
rizes this  and  many  other  early  language  model  innovations  used  in  the  IBM 
language  models. 

While  smoothing  had  been  applied  as  an  engineering  solution  to  the 
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zero-frequency  problem  at  least  as  early  as  Jeffreys  (1948)  (add-one  smooth- 
ing), it  is  only  relatively  recently  that  smoothing  received  serious  atten- 
tion. Church  and  Gale  (1991)  gives  a good  description  of  the  Good-Turing 
method,  as  well  as  the  proof,  and  also  gives  a good  description  of  the  Deleted 
Interpolation  method  and  a new  smoothing  method.  Sampson  (1996)  also 
has  a useful  discussion  of  Good-Turing.  Problems  with  the  Add-one  algo- 
rithm arc  summarized  in  Gale  and  Church  (1994).  Method  C in  Witten  and 
Bell  (1991)  describes  what  we  called  Witten-Bell  discounting.  Chen  and 
Goodman  (1996)  give  an  empirical  comparison  of  different  smoothing  algo- 
rithms, including  two  new  methods,  average-count  and  one-count , as  well  as 
Church  and  Gale’s.  Iyer  and  Ostendorf  (1997)  discuss  a way  of  smoothing 
by  adding  in  data  from  additional  corpora. 

Much  recent  work  on  language  modeling  has  focused  on  ways  to  build 
more  sophisticated  /V-grams.  These  approaches  include  giving  extra  weight 
to  /V-grams  which  have  already  occurred  recently  (the  cache  LM  of  Kuhn  cache  lm 
and  de  Mori  (1990)),  choosing  long-distance  triggers  instead  of  just  local  triggers 
A'-grams  (Rosenfeld,  1996;  Niesler  and  Woodland,  1999;  Zhou  and  Lua, 

1998),  and  using  variable-length  /V-grams  (Ney  el  al,  1994;  Kneser,  1996; 

Niesler  and  Woodland,  1996).  Another  class  of  approaches  use  semantic  in- 
formation to  enrich  the  /V-gram.  including  semantic  word  associations  based 
on  the  latent  semantic  indexing  described  in  Chapter  15  (Coccaro  and  Ju-  s||8mp 
rafsky,  1998;  Bellegarda,  1999)),  and  from  on-line  dictionaries  or  thesauri 
(Demetriou  el  al,  1997).  Class-based  /V-grams,  based  on  word  classes  such  class-based 
as  parts-of-speech,  arc  described  in  Chapter  8.  Language  models  based  on 
more  structured  linguistic  knowledge  (such  as  probabilistic  parsers)  arc  de- 
scribed in  Chapter  12.  Finally,  a number  of  augmentations  to  /V-grams  arc 
based  on  discourse  knowledge,  such  as  using  knowledge  of  the  current  topic 
(Chen  etal,  1998;  Seymore  and  Rosenfeld,  1 997 ; Seymore  et  al. , 1998;  Flo- 
rian  and  Yarowsky,  1999;  Khudanpur  and  Wu,  1999)  or  the  current  speech 
act  in  dialog  (see  Chapter  19). 


6.8  Summary 

This  chapter  introduced  the  /V-gram.  one  of  the  oldest  and  most  broadly  use- 
ful practical  tools  in  language  processing. 

• An  /V-gram  probability  is  the  conditional  probability  of  a word  given 
the  previous  N — 1 words.  /V-gram  probabilities  can  be  computed  by 
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simply  counting  in  a corpus  and  normalizing  (the  Maximum  Likeli- 
hood Estimate)  or  they  can  be  computed  by  more  sophisticated  algo- 
rithms. The  advantage  of  (V-grams  is  that  they  take  advantage  of  lots 
of  rich  lexical  knowledge.  A disadvantage  for  some  purposes  is  that 
they  arc  very  dependent  on  the  coipus  they  were  trained  on. 

• Smoothing  algorithms  provide  a better  way  of  estimating  the  proba- 
bility of  A-grams  which  never  occur.  Commonly-used  smoothing  al- 
gorithms include  backoff  or  deleted  interpolation,  with  Witten-Bell 
or  Good-Turing  discounting. 

• Corpus-based  language  models  like  /V-grams  arc  evaluated  by  sepa- 
rating the  corpus  into  a training  set  and  a test  set,  training  the  model 
on  the  training  set,  and  evaluating  on  the  test  set.  The  entropy  H,  or 
more  commonly  the  perplexity  2H  (more  properly  cross-entropy  and 
cross-perplexity)  of  a test  set  arc  used  to  compare  language  models. 


Exercises 

6.1  Write  out  the  equation  for  trigram  probability  estimation  (modifying 
Equation  6.1 1) 

6.2  Write  out  the  equation  for  the  discount  d = ^ for  add-one  smoothing. 
Do  the  same  for  Witten-Bell  smoothing.  How  do  they  differ? 

6.3  Write  a program  (Perl  is  sufficient)  to  compute  unsmoothed  unigrams 
and  bigrams. 

6.4  Run  your  /V-gram  program  on  two  different  small  corpora  of  your 
choice  (you  might  use  email  text  or  newsgroups).  Now  compare  the  statistics 
of  the  two  corpora.  What  are  the  differences  in  the  most  common  unigrams 
between  the  two?  How  about  interesting  differences  in  bigrams? 

6.5  Add  an  option  to  your  program  to  generate  random  sentences. 

6.6  Add  an  option  to  your  program  to  do  Witten-Bell  discounting. 

6.7  Add  an  option  to  your  program  to  compute  the  entropy  (or  perplexity) 
of  a test  set. 
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6.8  Suppose  someone  took  all  the  words  in  a sentence  and  reordered  them 
randomly.  Write  a program  which  take  as  input  such  a bag  of  words  and  jjj,A0GRg| 
produces  as  output  a guess  at  the  original  order.  Use  the  Viterbi  algorithm 
and  an  N -gram  grammar  produced  by  your  /V-gram  program  (on  some  cor- 
pus). 


6.9  The  field  of  authorship  attribution  is  concerned  with  discovering  the 
author  of  a particular  text.  Authorship  attribution  is  important  in  many  fields, 
including  history,  literature,  and  forensic  linguistics.  For  example  Mosteller 
and  Wallace  (1964)  applied  authorship  identification  techniques  to  discover 
who  wrote  The  Federalist  papers.  The  Federalist  papers  were  written  in 
1787-1788  by  Alexander  Flamilton,  John  Jay  and  James  Madison  to  per- 
suade New  York  to  ratify  the  United  States  Constitution.  They  were  pub- 
lished anonymously,  and  as  a result,  although  some  of  the  85  essays  were 
clearly  attributable  to  one  author  or  another,  the  authorship  of  12  were  in 
dispute  between  Hamilton  and  Madison.  Foster  (1989)  applied  authorship 
identification  techniques  to  suggest  that  W.S.’s  Funeral  Elegy  for  William 
Peter  was  probably  written  by  William  Shakespeare,  and  that  the  anonymous 
author  of  Primary  Colors  the  roman  a clef  about  the  Clinton  campaign  for 
the  American  presidency,  was  journalist  Joe  Klein  (Foster,  1996). 

A standard  technique  for  authorship  attribution,  first  used  by  Mosteller 
and  Wallace,  is  a Bayesian  approach.  For  example,  they  trained  a proba- 
bilistic model  of  the  writing  of  Hamilton,  and  another  model  of  the  writings 
of  Madison,  and  computed  the  maximum-likelihood  author  for  each  of  the 
disputed  essays.  There  arc  many  complex  factors  that  go  into  these  models, 
including  vocabulary  use,  word-length,  syllable  structure,  rhyme,  grammar; 
see  (Holmes,  1994)  for  a summary.  This  approach  can  also  be  used  for  iden- 
tifying which  genre  a a text  comes  from. 

One  factor  in  many  models  is  the  use  of  rare  words.  As  a simple  ap- 
proximation to  this  one  factor,  apply  the  Bayesian  method  to  the  attribution 
of  any  particular  text.  You  will  need  3 things:  (1)  a text  to  test,  (2)  two  po- 
tential authors  or  genres,  with  a large  on-line  text  sample  of  each.  One  of 
them  should  be  the  correct  author.  Train  a unigram  language  model  on  each 
of  the  candidate  authors.  You  arc  only  going  to  use  the  singleton  unigrams 
in  each  language  model.  You  will  compute  P(T\A\),  the  probability  of  the 
text  given  author  or  genre  A\,  by  (1)  taking  the  language  model  from  A\, 
(2)  by  multiplying  together  the  the  probabilities  of  all  the  unigrams  that  only 
occur  once  in  the  ‘unknown’  text  and  (3)  taking  the  geometric  mean  of  these 
(i.e.  the  nth  root,  where  n is  the  number  of  probabilities  you  multiplied). 
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Do  the  same  for  A%.  Choose  whichever  is  higher.  Did  it  produce  the  correct 
candidate? 


HMMS  AND  SPEECH 
RECOGNITION 


When  Frederic  was  a little  lad  he  proved  so  brave  and  daring, 

His  father  thought  he’d  ’prentice  him  to  some  career  seafaring. 

I was,  alas!  his  nurs’rymaid,  and  so  it  fell  to  my  lot 
To  take  and  bind  the  promising  boy  apprentice  to  a pilot  - 
A life  not  bad  for  a hardy  lad,  though  surely  not  a high  lot, 

Though  I'm  a nurse,  you  might  do  worse  than  make  your  boy  a pilot. 
I was  a stupid  nurs’rymaid,  on  breakers  always  steering. 

And  I did  not  catch  the  word  aright,  through  being  hard  of  healing; 
Mistaking  my  instructions,  which  within  my  brain  did  gyrate, 

I took  and  bound  this  promising  boy  apprentice  to  a pirate. 

The  Pirates  of  Penzance,  Gilbert  and  Sullivan,  1877 


Alas,  this  mistake  by  nurserymaid  Ruth  led  to  Frederic’s  long  indenture  as  a 
pirate  and,  due  to  a slight  complication  involving  twenty-first  birthdays  and 
leap  years,  nearly  led  to  63  extra  years  of  apprenticeship.  The  mistake  was 
quite  natural,  in  a Gilbert- and-Sullivan  sort  of  way;  as  Ruth  later  noted,  “The 
two  words  were  so  much  alike!”.  True,  true;  spoken  language  understanding 
is  a difficult  task,  and  it  is  remarkable  that  humans  do  as  well  at  it  as  we  do. 
The  goal  of  automatic  speech  recognition  (ASR)  research  is  to  address  this 
problem  computationally  by  building  systems  which  map  from  an  acoustic 
signal  to  a string  of  words.  Automatic  speech  understanding  (ASU)  extends 
this  goal  to  producing  some  sort  of  understanding  of  the  sentence,  rather  than 
just  the  words. 

The  general  problem  of  automatic  transcription  of  speech  by  any  speaker 
in  any  environment  is  still  far  from  solved.  But  recent  years  have  seen  ASR 
technology  mature  to  the  point  where  it  is  viable  in  certain  limited  domains. 
One  major  application  area  is  in  human-computer  interaction.  While  many 
tasks  arc  better  solved  with  visual  or  pointing  interfaces,  speech  has  the  po- 
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tential  to  be  a better  interface  than  the  keyboard  for  tasks  where  full  natural 
language  communication  is  useful,  or  for  which  keyboards  arc  not  appro- 
priate. This  includes  hands-busy  or  eyes-busy  applications,  such  as  where 
the  user  has  objects  to  manipulate  or  equipment  to  control.  Another  impor- 
tant application  area  is  telephony,  where  speech  recognition  is  already  used 
for  example  for  entering  digits,  recognizing  ”yes”  to  accept  collect  calls,  or 
call-routing  (“Accounting,  please”,  “Prof.  Landauer,  please”).  Finally,  ASR 
is  being  applied  to  dictation,  i.e.  transcription  of  extended  monologue  by 
a single  specific  speaker.  Dictation  is  common  in  fields  such  as  law  and  is 
also  important  as  paid  of  augmentative  communication  (interaction  between 
computers  and  humans  with  some  disability  resulting  in  the  inability  to  type, 
or  the  inability  to  speak).  The  blind  Milton  famously  dictated  Paradise  Lost 
to  his  daughters,  and  Henry  James  dictated  his  later  novels  after  a repetitive 
stress  injury. 

Different  applications  of  speech  technology  necessarily  place  different 
constraints  on  the  problem  and  lead  to  different  algorithms.  We  chose  to  fo- 
cus this  chapter  on  the  fundamentals  of  one  crucial  area:  Large-Vocabulary 
LvcsR  Continuous  Speech  Recognition  (LVCSR),  with  a small  section  on  acous- 
tic issues  in  speech  synthesis.  Large-vocabulary  generally  means  that  the 
systems  have  a vocabulary  of  roughly  5,000  to  60,000  words.  The  term  con- 
continuous  tinuous  means  that  the  words  arc  run  together  naturally;  it  contrasts  with 

wordted"  isolated-word  speech  recognition,  in  which  each  word  must  be  preceded 

and  followed  by  a pause.  Furthermore,  the  algorithms  we  will  discuss  arc 
fNPDEEApKEENDENT  generally  speaker-independent;  that  is,  they  arc  able  to  recognize  speech 
from  people  whose  speech  the  system  has  never  been  exposed  to  before. 

The  chapter  begins  with  an  overview  of  speech  recognition  architec- 
ture, and  then  proceeds  to  introduce  the  HMM,  the  use  of  the  Viterbi  and 
A*  algorithms  for  decoding,  speech  acoustics  and  features,  and  the  use  of 
Gaussians  and  MLPs  to  compute  acoustic  probabilities.  Even  relying  on  the 
previous  three  chapters,  summarizing  this  much  of  the  field  in  this  chapter 
requires  us  to  omit  many  crucial  areas;  the  reader  is  encouraged  to  see  the 
suggested  readings  at  the  end  of  the  chapter  for  useful  textbooks  and  articles. 
This  chapter  also  includes  a short  section  on  the  acoustic  component  of  the 
speech  synthesis  algorithms  discussed  in  Chapter  4. 
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7.1  Speech  Recognition  Architecture 


Previous  chapters  have  introduced  many  of  the  core  algorithms  used  in  speech 
recognition.  Chapter  4 introduced  the  notions  of  phone  and  syllable.  Chap- 
ter 5 introduced  the  noisy  channel  model,  the  use  of  the  Bayes  rule,  and 
the  probabilistic  automaton.  Chapter  6 introduced  the  /V-gram  language 
model  and  the  perplexity  metric.  In  this  chapter  we  introduce  the  remaining 
components  of  a modern  speech  recognizer:  the  Hidden  Markov  Model 
(HMM),  the  idea  of  spectral  features,  the  forward-backward  algorithm 
for  HMM  training,  and  the  Viterbi  and  stack  decoding  (also  called  A*  de- 
coding algorithms  for  solving  the  decoding  problem:  mapping  from  strings  ^C0DING 
of  phone  probability  vectors  to  strings  of  words. 

Let’s  begin  by  revisiting  the  noisy  channel  model  that  we  saw  in  Chap- 
ter 5.  Speech  recognition  systems  treat  the  acoustic  input  as  if  it  were  a 
‘noisy’  version  of  the  source  sentence.  In  order  to  ‘decode’  this  noisy  sen- 
tence, we  consider  all  possible  sentences,  and  for  each  one  we  compute 
the  probability  of  it  generating  the  noisy  sentence.  We  then  chose  the  sen- 
tence with  the  maximum  probability.  Figure  7.1  shows  this  noisy-channel 
metaphor. 


source 

sentence 

If  music  be  the 
food  of  love... 


noisy 

sentence 


guess  at 
decoder  . original 

?Alice  was  beginning  to  getSv  sentence 

?Every  happy  family...  \ 

?ln  a hole  in  the  ground...  If  music  be  the 

?lf  music  be  the  food  of  love...  y f°°d  °f  love... 
?lf  music  be  the  foot  of  dove../ 


Figure  7.1  The  noisy  channel  model  applied  to  entire  sentences  (Figure  5.1 
showed  its  application  to  individual  words).  Modem  speech  recognizers  work 
by  searching  through  a huge  space  of  potential  ‘source’  sentences  and  choos- 
ing the  one  which  has  the  highest  probability  of  generating  the  ‘noisy’  sen- 
tence. To  do  this  they  must  have  models  that  express  the  probability  of 
sentences  being  realized  as  certain  strings  of  words  (A-grams),  models  that 
express  the  probability  of  words  being  realized  as  certain  strings  of  phones 
(HMMs)  and  models  that  express  the  probability  of  phones  being  realized  as 
acoustic  or  spectral  features  (Gaussians/MLPs). 


Implementing  the  noisy-channel  model  as  we  have  expressed  it  in  Fig- 
ure 7. 1 requires  solutions  to  two  problems.  First,  in  order  to  pick  the  sentence 
that  best  matches  the  noisy  input  we  will  need  a complete  metric  for  a “best 
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match”.  Because  speech  is  so  variable,  an  acoustic  input  sentence  will  never 
exactly  match  any  model  we  have  for  this  sentence.  As  we  have  suggested 
in  previous  chapters,  we  will  use  probability  as  our  metric,  and  will  show 
how  to  combine  the  various  probabilistic  estimators  to  get  a complete  esti- 
mate for  the  probability  of  a noisy  observation-sequence  given  a candidate 
sentence.  Second,  since  the  set  of  all  English  sentences  is  huge,  we  need 
an  efficient  algorithm  that  will  not  search  through  all  possible  sentences,  but 
only  ones  that  have  a good  chance  of  matching  the  input.  This  is  the  decod- 
ing or  search  problem,  and  we  will  summarize  two  approaches:  the  Viterbi 
or  dynamic  programming  decoder,  and  the  stack  or  A*  decoder. 

In  the  rest  of  this  introduction  we  will  introduce  the  probabilistic  or 
Bayesian  model  for  speech  recognition  (or  more  accurately  re-introduce  it, 
since  we  first  used  the  model  in  our  discussions  of  spelling  and  pronunciation 
in  Chapter  5);  we  leave  discussion  of  decoding/search  for  pages  242-249. 

The  goal  of  the  probabilistic  noisy  channel  architecture  for  speech 
recognition  can  be  summarized  as  follows: 

“What  is  the  most  likely  sentence  out  of  all  sentences  in  the  lan- 
guage L given  some  acoustic  input  O?” 

We  can  treat  the  acoustic  input  O as  a sequence  of  individual  ‘symbols’ 
or  ‘observations’  (for  example  by  slicing  up  the  input  every  10  milliseconds, 
and  representing  each  slice  by  floating-point  values  of  the  energy  or  fre- 
quencies of  that  slice).  Each  index  then  represents  some  time  interval,  and 
successive  ot  indicate  temporally  consecutive  slices  of  the  input  (note  that 
capital  letters  will  stand  for  sequences  of  symbols  and  lower-case  letters  for 
individual  symbols): 

O = 01,02, 03,..., o,  (7.1) 

Similarly,  we  will  treat  a sentence  as  if  it  were  composed  simply  of  a 
string  of  words: 

W = wi,w2,w3,...,wn  (7.2) 

Both  of  these  are  simplifying  assumptions;  for  example  dividing  sen- 
tences into  words  is  sometimes  too  fine  a division  (we’d  like  to  model  facts 
about  groups  of  words  rather  than  individual  words)  and  sometimes  too  gross 
a division  (we’d  like  to  talk  about  morphology).  Usually  in  speech  recogni- 
tion a word  is  defined  by  orthography  (after  mapping  every  word  to  lower- 
case): oak  is  treated  as  a different  word  than  oaks , but  the  auxiliary  can  (“can 
you  tell  me. . . ”)  is  treated  as  the  same  word  as  the  noun  can  (“i  need  a can 
of. . . ” ).  Recent  ASR  research  has  begun  to  focus  on  building  more  so- 
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phisticated  models  of  ASR  words  incorporating  the  morphological  insights 
of  Chapter  3 and  the  part-of-speech  information  that  we  will  study  in  Chap- 
ter 8. 

The  probabilistic  implementation  of  our  intuition  above,  then,  can  be 
expressed  as  follows: 

W = argmaxP(W|0)  (7.3) 

Wgl 


Recall  that  the  function  argma xx./(x)  means  ‘the  x such  that  f(x)  is 
largest’.  Equation  (7.3)  is  guaranteed  to  give  us  the  optimal  sentence  W ; we 
now  need  to  make  the  equation  operational.  That  is,  for  a given  sentence  W 
and  acoustic  sequence  O we  need  to  compute  P{W\0).  Recall  that  given  any 
probability  P(x\y),  we  can  use  Bayes’  rule  to  break  it  down  as  follows: 


n*b) 


P(y\x)P(x) 

P(y ) 


(7.4) 


We  saw  in  Chapter  5 that  we  can  substitute  (7.4)  into  (7.3)  as  follows: 


W = argmax 

WeL 


P{0\W)P(W) 

no] 


(7.5) 


The  probabilities  on  the  right  hand  of  (7.5)  are  for  the  most  paid  easier 
to  compute  than  P(W\0).  For  example,  P(W),  the  prior  probability  of  the 
word  string  itself  is  exactly  what  is  estimated  by  the  n-gram  language  mod- 
els of  Chapter  6.  And  we  will  see  below  that  P(0\W)  turns  out  to  be  easy 
to  estimate  as  well.  But  P(O),  the  probability  of  the  acoustic  observation 
sequence,  turns  out  to  be  harder  to  estimate.  Luckily,  we  can  ignore  P(0) 
just  as  we  saw  in  Chapter  5.  Why?  Since  we  are  maximizing  over  all  pos- 
sible sentences,  we  will  be  computing  p(°\w)P(w)  for  cach  sentence  in  the 
language.  But  P(0 ) doesn’t  change  for  each  sentence!  For  each  potential 
sentence  we  are  still  examining  the  same  observations  O,  which  must  have 
the  same  probability  P{0).  Thus: 

W = argmax  ^(0\W)P(W)  _ (7.6) 

Wg£  P\0)  WgL 


To  summarize,  the  most  probable  sentence  W given  some  observation 
sequence  O can  be  computing  by  taking  the  product  of  two  probabilities  for 
each  sentence,  and  choosing  the  sentence  for  which  this  product  is  greatest. 
These  two  terms  have  names;  P(W),  the  prior  probability,  is  called  the  lan- 
guage model.  P(0\W),  the  observation  likelihood,  is  called  the  acoustic 
model. 


LANGUAGE 

MODEL 

ACOUSTIC 

MODEL 


238 


Chapter  7.  HMMs  and  Speech  Recognition 


likelihood  prior 

Key  Concept  #5.  W = argmax  P(0\W)  P{W)  (7.7) 

Wgl 

We  have  already  seen  in  Chapter  6 how  to  compute  the  language  model 
prior  P(W)  by  using  /V-gram  grammars.  The  rest  of  this  chapter  will  show 
how  to  compute  the  acoustic  model  P(0\W),  in  two  steps.  First  we  will 
make  the  simplifying  assumption  that  the  input  sequence  is  a sequence  of 
phones  F rather  than  a sequence  of  acoustic  observations.  Recall  that  we 
introduced  the  forward  algorithm  in  Chapter  5,  which  was  given  ‘obser- 
vations’ that  were  strings  of  phones,  and  produced  the  probability  of  these 
phone  observations  given  a single  word.  We  will  show  that  these  probabilis- 
tic phone  automata  are  really  a special  case  of  the  Hidden  Markov  Model, 
and  we  will  show  how  to  extend  these  models  to  give  the  probability  of  a 
phone  sequence  given  an  entire  sentence. 

One  problem  with  the  forward  algorithm  as  we  presented  it  was  that 
in  order  to  know  which  word  was  the  most-likely  word  (the  ‘decoding  prob- 
lem'), we  had  to  run  the  forward  algorithm  again  for  each  word.  This  is 
clearly  intractable  for  sentences;  we  can’t  possibly  run  the  forward  algo- 
rithm separately  for  each  possible  sentence  of  English.  We  will  thus  intro- 
duce two  different  algorithms  which  simultaneously  compute  the  likelihood 
of  an  observation  sequence  given  each  sentence,  and  give  us  the  most-likely 
sentence.  These  arc  the  Viterbi  and  the  A*  decoding  algorithms. 

Once  we  have  solved  the  likelihood-computation  and  decoding  prob- 
lems for  a simplified  input  consisting  of  strings  of  phones,  we  will  show 
how  the  same  algorithms  can  be  applied  to  true  acoustic  input  rather  than 
pre-defined  phones.  This  will  involve  a quick  introduction  to  acoustic  input 
and  feature  extraction,  the  process  of  deriving  meaningful  features  from 
the  input  soundwave.  Then  we  will  introduce  the  two  standard  models  for 
computing  phone -probabilities  from  these  features:  Gaussian  models,  and 
neural  net  (multi-layer  perceptrons)  models. 

Finally,  we  will  introduce  the  standard  algorithm  for  training  the  Hid- 
den Markov  Models  and  the  phone -probability  estimators,  the  forward- 
backward  or  Baum- Welch  algorithm)  (Baum,  1972),  a special  case  of  the 
the  Expectation-Maximization  or  EM  algorithm  (Dempster  el  at,  1977). 

As  a preview  of  the  chapter.  Figure  7.2  shows  an  outline  of  the  compo- 
nents of  a speech  recognition  system.  The  figure  shows  a speech  recognition 
system  broken  down  into  three  stages.  In  the  signal  processing  or  feature 
extraction  stage,  the  acoustic  waveform  is  sliced  up  into  frames  (usually 
of  10,  15,  or  20  milliseconds)  which  arc  transformed  into  spectral  features 
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which  give  information  about  how  much  energy  in  the  signal  is  at  different 
frequencies.  In  the  subword  or  phone  recognition  stage,  we  use  statistical 
techniques  like  neural  networks  or  Gaussian  models  to  tentatively  recognize 
individual  speech  sounds  like  p or  b.  For  a neural  network,  the  output  of 
this  stage  is  a vector  of  probabilities  over  phones  for  each  frame  (i.e.  ‘for 
this  frame  the  probability  of  [p]  is  .8,  the  probability  of  [b]  is  .1,  the  proba- 
bility of  [f]  is  .02,  etc’);  for  a Gaussian  model  the  probabilities  are  slightly 
different.  Finally,  in  the  decoding  stage,  we  take  a dictionary  of  word  pro- 
nunciations and  a language  model  (probabilistic  grammar)  and  use  a Viterbi 
or  A*  decoder  to  find  the  sequence  of  words  which  has  the  highest  proba-  decoder 
bility  given  the  acoustic  events. 


7.2  Overview  of  Hidden  Markov  Models 

In  Chapter  5 we  used  weighted  finite-state  automata  or  Markov  chains  to 

model  the  pronunciation  of  words.  The  automata  consisted  of  a sequence 
of  states  q = {q^cpqj  ■ ■■qn),  each  corresponding  to  a phone,  and  a set  of 
transition  probabilities  between  states,  aot;«i2,«i3>  encoding  the  probability 
of  one  phone  following  another.  We  represented  the  states  as  nodes,  and 
the  transition  probabilities  as  edges  between  nodes;  an  edge  existed  between 
two  nodes  if  there  was  a non-zero  transition  probability  between  the  two 
nodes.  We  also  saw  that  we  could  use  the  forward  algorithm  to  compute  the 
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likelihood  of  a sequence  of  observed  phones  o = (01O2O3  . . . o,).  Figure  7.3 
shows  an  automaton  for  the  word  need  with  sample  observation  sequence  of 
the  kind  we  saw  in  Chapter  5. 


a24  = .11 


Word  Model 

Observation 
Sequence 
(phone  symbols) 

Oi  o2  o3 


Figure  7.3  A simple  weighted  automaton  or  Markov  chain  pronunciation 
network  for  the  word  need,  showing  the  transition  probabilities,  and  a sample 
observation  sequence.  The  transition  probabilities  between  two  states  x 
and  y are  1.0  unless  otherwise  specified. 


While  we  will  see  that  these  models  figure  importantly  in  speech  recog- 
nition, they  simplify  the  problem  in  two  ways.  First,  they  assume  that  the 
input  consists  of  a sequence  of  symbols!  Obviously  this  is  not  true  in  the 
real  world,  where  speech  input  consists  essentially  of  small  movements  of 
air  particles.  In  speech  recognition,  the  input  is  an  ambiguous,  real-valued 
representation  of  the  sliced-up  input  signal,  called  features  or  spectral  fea- 
tures. We  will  study  the  details  of  some  of  these  features  beginning  on 
page  258;  acoustic  features  represent  such  information  as  how  much  energy 
there  is  at  different  frequencies.  The  second  simplifying  assumption  of  the 
weighted  automata  of  Chapter  5 was  that  the  input  symbols  correspond  ex- 
actly to  the  states  of  the  machine.  Thus  when  seeing  an  input  symbol  [b], 
we  knew  that  we  could  move  into  a state  labeled  [b].  In  a Hidden  Markov 
Model,  by  contrast,  we  can’t  look  at  the  input  symbols  and  know  which  state 
to  move  to.  The  input  symbols  don’t  uniquely  determine  the  next  state.1 

Recall  that  a weighted  automaton  or  simple  Markov  model  is  specified 
by  the  set  of  states  Q , the  set  of  transition  probabilities  A,  a defined  start 
state  and  end  state(s),  and  a set  of  observation  likelihoods  B.  For  weighted 

1 Actually,  as  we  mentioned  in  passing,  by  this  second  criterion  some  of  the  automata  we 
saw  in  Chapter  5 were  technically  HMMs  as  well.  This  is  because  the  first  symbol  in  the 
input  string  [11  iy]  was  compatible  with  the  [n]  states  in  the  words  need  or  an.  Seeing  the 
symbols  [n],  we  didn't  know  which  underlying  state  it  was  generated  by,  need-n  or  an-n. 
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automata,  we  defined  the  probabilities  bj(ot ) as  1.0  if  the  state  / matched  the 
observation  ot  and  0 if  they  didn’t  match.  An  HMM  formally  differs  from  a 
Markov  model  by  adding  two  more  requirements.  First,  it  has  a separate  set 
of  observation  symbols  O,  which  is  not  drawn  from  the  same  alphabet  as  the 
state  set  Q.  Second,  the  observation  likelihood  function  B is  not  limited  to 
the  values  1.0  and  0;  in  an  HMM  the  probability  bi(o,)  can  take  on  any  value 
from  0 to  1 .0. 


a24 


Word  Model 


Observation 
Sequence 
(spectral  feature 
vectors) 


°1  °2 


°3  °4  °5  °6 


Figure  7.4  An  HMM  pronunciation  network  for  the  word  need , showing 
the  transition  probabilities,  and  a sample  observation  sequence.  Note  the  ad- 
dition of  the  output  probabilities  B.  HMMs  used  in  speech  recognition  usually 
use  self-loops  on  the  states  to  model  variable  phone  durations. 


Figure  7.4  shows  an  HMM  for  the  word  need  and  a sample  observa- 
tion sequence.  Note  the  differences  from  Figure  7.3.  First,  the  observation 
sequences  are  now  vectors  of  spectral  features  representing  the  speech  sig- 
nal. Next,  note  that  we’ve  also  allowed  one  state  to  generate  multiple  copies 
of  the  same  observation,  by  having  a loop  on  the  state.  This  loops  allows 
HMMs  to  model  the  variable  duration  of  phones;  longer  phones  require  more 
loops  through  the  HMM. 

In  summary,  here  are  the  parameters  we  need  to  define  an  HMM: 

• states:  A set  of  states  Q = q\qi  ■ ■ ■ qN- 

• transition  probabilities:  A set  of  probabilities  A = aoiaoi  • • • an  i • • • ann. 
Each  a\j  represents  the  probability  of  transitioning  from  state  i to  state 
j.  The  set  of  these  is  the  transition  probability  matrix^ 

• observation  likelihoods:  A set  of  observation  likelihoods  B = bj(ot), 
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each  expressing  the  probability  of  an  observation  ot  being  generated 
from  a state  i. 

In  our  examples  so  far  we  have  used  two  ‘special’  states  (non-emitting 
states)  as  the  start  and  end  state;  as  we  saw  in  Chapter  5 it  is  also  possible  to 
avoid  the  use  of  these  states  by  specifying  two  more  things: 

• initial  distribution:  An  initial  probability  distribution  over  states,  7t, 
such  that  Tij  is  the  probability  that  the  HMM  will  start  in  state  i.  Of 
course  some  states  j may  have  Kj  = 0,  meaning  that  they  cannot  be 
initial  states. 

• accepting  states:  A set  of  legal  accepting  states. 

As  was  true  for  the  weighted  automata,  the  sequences  of  symbols  that 
are  input  to  the  model  (if  we  arc  thinking  of  it  as  recognizer)  or  which  arc 
produced  by  the  model  (if  we  arc  thinking  of  it  as  a generator)  arc  generally 
called  the  observation  sequence,  referred  to  as  O = (010203  • • -Ot). 


7.3  The  Viterbi  Algorithm  Revisited 

Chapter  5 showed  how  the  forward  algorithm  could  be  used  to  compute  the 
probability  of  an  observation  sequence  given  an  automaton,  and  how  the 
Viterbi  algorithm  can  be  used  to  find  the  most-likely  path  through  the  au- 
tomaton, as  well  as  the  probability  of  the  observation  sequence  given  this 
most-likely  path.  In  Chapter  5 the  observation  sequences  consisted  of  a 
single  word.  But  in  continuous  speech,  the  input  consists  of  sequences  of 
words,  and  we  arc  not  given  the  location  of  the  word  boundaries.  Knowing 
where  the  word  boundaries  arc  massively  simplifies  the  problem  of  pronun- 
ciation; in  Chapter  5 since  we  were  sure  that  the  pronunciation  [ni]  came 
from  one  word,  we  only  had  7 candidates  to  compare.  But  in  actual  speech 
we  don’t  know  where  the  word  boundaries  arc.  For  example,  tty  to  decode 
the  following  sentence  from  Switchboard  (don’t  peek  ahead!): 

[ay  d ih  s hh  er  d s ah  m th  ih  ng  ax  b aw  m uh  v ih  ng  r ih  s en  1 ih] 

The  answer  is  in  the  footnote.2  The  task  is  har'd  partly  because  of  coar- 
ticulation and  fast  speech  (e.g.  [d]  for  the  first  phone  of  just\).  But  mainly 
it’s  the  lack  of  spaces  indicating  word  boundaries  that  make  the  task  difficult. 
The  task  of  finding  word  boundaries  in  connected  speech  is  called  segmen- 
tation and  we  will  solve  it  by  using  the  Viterbi  algorithm  just  as  we  did  for 
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Chinese  word-segmentation  in  Chapter  5;  Recall  that  the  algorithm  for  Chi- 
nese word-segmentation  relied  on  choosing  the  segmentation  that  resulted 
in  the  sequence  of  words  with  the  highest  frequency.  For  speech  segmenta- 
tion we  use  the  more  sophisticated  /V-grarn  language  models  introduced  in 
Chapter  6.  In  the  rest  of  this  section  we  show  how  the  Viterbi  algorithm  can 
be  applied  to  the  task  of  decoding  and  segmentation  of  a simple  string  of 
observations  phones,  using  an  n-gram  language  model.  We  will  show  how 
the  algorithm  is  used  to  segment  a very  simple  string  of  words.  Here’s  the 
input  and  output  we  will  work  with: 

Input  Output 

[aa  n iy  dh  ax]  I need  the 

Figure  7.5  shows  word  models  for  I,  need,  the,  and  also,  just  to  make 
things  difficult,  the  word  on. 


Recall  that  the  goal  of  the  Viterbi  algorithm  is  to  find  the  best  state  se- 
quence q = (^ir/2^3  ...c/;)  given  the  set  of  observed  phones  o = (01O2O3  . . .ot). 
A graphic  illustration  of  the  output  of  the  dynamic  programming  algorithm  is 
shown  in  Figure  7.6.  Along  the  y-axis  arc  all  the  words  in  the  lexicon;  inside 
each  word  arc  its  states.  The  x-axis  is  ordered  by  time,  with  one  observed 
phone  per  time  unit.3  Each  cell  in  the  matrix  will  contain  the  probability  of 

3 This  x-axis  component  of  the  model  is  simplified  in  two  major  ways  that  we  will  show 
how  to  fix  in  the  next  section.  First,  the  observations  will  not  be  phones  but  extracted  spectral 
features,  and  second,  each  phone  consists  of  not  time  unit  observation  but  many  observations 
(since  phones  can  last  for  more  than  one  phone).  The  y-axis  is  also  simplified  in  this  example, 
since  as  we  will  see  most  ASR  system  use  multiple  ‘subphone’  units  for  each  phone. 
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the  most-likely  sequence  ending  at  that  state.  We  can  find  the  most-likely 
state  sequence  for  the  entire  observation  string  by  looking  at  the  cell  in  the 
right-most  column  that  has  the  highest-probability,  and  tracing  back  the  se- 
quence that  produced  it. 


Figure  7.6  An  illustration  of  the  results  of  the  Viterbi  algorithm  used  to 
find  the  most-likely  phone  sequence  (and  hence  estimate  the  most-likely  word 
sequence). 


DYNAMIC 

PROGRAM- 

MING 

INVARIANT 


More  formally,  we  arc  searching  for  the  best  state  sequence  q*  = ( ry i r/2  . . -qr), 
given  an  observation  sequence  o = (0102  ■■■Ot)  and  a model  (a  weighted  au- 
tomaton or  ‘state  graph’)  X.  Each  cell  viterbi[i,  t]oi  the  matrix  contains  the 
probability  of  the  best  path  which  accounts  for  the  first  t observations  and 
ends  in  state  / of  the  HMM.  This  is  the  most-probable  path  out  of  all  possible 
sequences  of  states  of  length  t — 1 : 

viterbi[t.i]  = max  P(q\q2 . . .qt-i,qt  = i, 01,02  ■ . -ot\X)  (7.8) 

91,92  ,-,9r-l 

In  order  to  compute  viterbi [t,i],  the  Viterbi  algorithm  assumes  the  dy- 
namic programming  invariant.  This  is  the  simplifying  (but  incorrect)  as- 
sumption that  if  the  ultimate  best  path  for  the  entire  observation  sequence 
happens  to  go  through  a state  q,,  that  this  best  path  must  include  the  best 
path  up  to  and  including  state  c/,-.  This  doesn’t  mean  that  the  best  path  at  any 
time  t is  the  best  path  for  the  whole  sequence.  A path  can  look  bad  at  the 
beginning  but  turn  out  to  be  the  best  path.  As  we  will  see  later,  the  Viterbi 
assumption  breaks  down  for  certain  kinds  of  grammars  (including  trigram 
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grammars)  and  so  some  recognizers  have  moved  to  another  kind  of  decoder, 
the  stack  or  A*  decoder;  more  on  that  later.  As  we  saw  in  our  discussion 
of  the  minimum-edit-distance  algorithm  in  Chapter  5,  the  reason  for  making 
the  Viterbi  assumption  is  that  it  allows  us  to  break  down  the  computation  of 
the  optimal  path  probability  in  a simple  way;  each  of  the  best  paths  at  time  t 
is  the  best  extension  of  each  of  the  paths  ending  at  time  t — 1 . In  other  words, 
the  recurrence  relation  for  the  best  path  at  time  t ending  in  state  j,  viterbi [ t,j ], 
is  the  maximum  of  the  possible  extensions  of  every  possible  previous  path 
from  time  l — 1 to  time  t: 

viterbi[t,j\  = ma  x(viterbi[t  — 1 ,i\ajj)bj(ot)  (7.9) 

i 

The  algorithm  as  we  describe  it  in  Figure  7.9  takes  a sequence  of  ob- 
servations, and  a single  probabilistic  automaton,  and  returns  the  optimal  path 
through  the  automaton.  Since  the  algorithm  requires  a single  automaton,  we 
will  need  to  combine  the  different  probabilistic  phone  networks  for  the,  I, 
need,  and  a into  one  automaton.  In  order  to  build  this  new  automaton  we 
will  need  to  add  arcs  with  probabilities  between  any  two  words:  bigram 
probabilities.  Figure  7.7  shows  simple  bigram  probabilities  computed  from 
the  combined  Brown  and  Switchboard  corpus. 


I need 

0.0016 

need  need 

0.000047 

# Need 

0.000018 
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need  on 

0.000047 
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II 
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need  I 

0.000016 
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0.0031 

the  I 

0.00051 

on  I 

0.00085 

Figure  7.7  Bigram  probabilities  for  the  words  the,  on,  need,  and  / following 
each  other,  and  starting  a sentence  (i.e.  following  #).  Computed  from  the 
combined  Brown  and  Switchboard  corpora  with  add-0.5  smoothing. 

Figure  7.8  shows  the  combined  pronunciation  networks  for  the  4 words 
together  with  a few  of  the  new  arcs  with  the  bigram  probabilities.  For  read- 
ability of  the  diagram,  most  of  the  arcs  aren't  shown;  the  reader  should  imag- 
ine that  each  probability  in  Figure  7.7  is  inserted  as  an  arc  between  every  two 
words. 

The  algorithm  is  given  in  Figure  5.19  in  Chapter  5,  and  is  repeated 
here  for  convenience  as  Figure  7.9.  We  see  in  Figure  7.9  that  the  Viterbi 
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algorithm  sets  up  a probability  matrix,  with  one  column  for  each  time  index 
t and  one  row  for  each  state  in  the  state  graph.  The  algorithm  first  creates 
T + 2 columns;  Figure  7.9  shows  the  first  6 columns.  The  first  column  is 
an  initial  pseudo-observation,  the  next  corresponds  to  the  first  observation 
phone  [aa],  and  so  on.  We  begin  in  the  first  column  by  setting  the  probability 
of  the  start  state  to  1.0,  and  the  other  probabilities  to  0;  the  reader  should 
find  this  in  Figure  7.10.  Cells  with  probability  0 arc  simply  left  blank  for 
readability.  For  each  column  of  the  matrix,  i.e.  for  each  time  index  t,  each 
cell  viterbi[ t,j ],  will  contain  the  probability  of  the  most  likely  path  to  end  in 
that  cell.  We  will  calculate  this  probability  recursively,  by  maximizing  over 
the  probability  of  coming  from  all  possible  preceding  states.  Then  we  move 
to  the  next  state;  for  each  of  the  i states  viterbi[0,i]  in  column  0,  we  compute 
the  probability  of  moving  into  each  of  the  j states  viterbi[l,j]  in  column  1, 
according  to  the  recurrence  relation  in  (7.9).  In  the  column  for  the  input  aa, 
only  two  cells  have  non-zero  entries,  since  b\(aa)  is  zero  for  every  other 
state  except  the  two  states  labeled  aa.  The  value  of  viterbi(  1 ,aa)  of  the  word 
I is  the  product  of  the  transition  probability  from  # to  / and  the  probability  of 
I being  pronounced  with  the  vowel  aa. 

Notice  that  if  we  look  at  the  column  for  the  observation  n , that  the  word 
on  is  currently  the  ‘most-probable’  word.  But  since  there  is  no  word  or  set  of 
words  in  this  lexicon  which  is  pronounced  i dh  ax,  the  path  stalling  with  on 
is  a dead  end,  i.e.  this  hypothesis  can  never  be  extended  to  cover  the  whole 
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function  V iter ^{observations  of  len  T,state-grapli)  returns  best-path 

num-states  <— NUM-OF-STATES(sfafe-grap/z) 

Create  a path  probability  matrix  viterbi[num-states+2,T+2] 

viterbi[0,0]  <—1.0 

for  each  time  step  t from  0 to  '/  do 

for  each  state  .v  from  0 to  num-states  do 

for  each  transition  s'  from  s specified  by  state-graph 
new-score  ^viterbi[s,  t]  * a[s,s']  * bsi(ot) 
if  ((viterbi\s',t+l]  = 0)  ||  ( new-score  > viterbi[s',  t+1 ])) 
then 

viterbi[s',  t+1]  <— new-score 
back-pointer[s' , t+1]  <—  s 

Backtrace  from  highest  probability  state  in  the  final  column  of  viterbi[]  and 
return  path 


Figure  7.9  Viterbi  algorithm  for  finding  optimal  sequence  of  states  in  con- 
tinuous speech  recognition,  simplified  by  using  phones  as  inputs  (duplicate  of 
Figure  5.19).  Given  an  observation  sequence  of  phones  and  a weighted  au- 
tomaton (state  graph),  the  algorithm  returns  the  path  through  the  automaton 
which  has  minimum  probability  and  accepts  the  observation  sequence.  a[s,s'] 
is  the  transition  probability  from  current  state  s to  next  state  s'  and  bsi(o,)  is 
the  observation  likelihood  of  s’  given  o,. 


utterance. 

By  the  time  we  see  the  observation  iy,  there  are  two  competing  paths: 
I need  and  I the ; I need  is  currently  more  likely.  When  we  get  to  the  obser- 
vation dli,  we  could  have  arrived  from  either  the  iy  of  need  or  the  iy  of  the. 
The  probability  of  the  max  of  these  two  paths,  in  this  case  the  path  through  I 
need,  will  go  into  the  cell  for  dli. 

Finally,  the  probability  for  the  best  path  will  appeal-  in  the  final  ax 
column.  In  this  example,  only  one  cell  is  non-zero  in  this  column;  the  ax 
state  of  the  word  the  (a  real  example  wouldn’t  be  this  simple;  many  other 
cells  would  be  non-zero). 

If  the  sentence  had  actually  ended  here,  we  would  now  need  to  back- 
trace to  find  the  path  that  gave  us  this  probability.  We  can’t  just  pick  the 
highest  probability  state  for  each  state  column.  Why  not?  Because  the  most 
likely  path  early  on  is  not  necessarily  the  most  likely  path  for  the  whole  sen- 
tence. Recall  that  the  most  likely  path  after  seeing  n was  the  word  on.  But 
the  most  likely  path  for  the  whole  sentence  is  I need  the.  Thus  we  had  to 
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# aa  n iy  dh  ax 


Figure  7.10  The  entries  in  the  individual  state  columns  for  the  Viterbi  al- 
gorithm. Each  cell  keeps  the  probability  of  the  best  path  so  far  and  a pointer 
to  the  previous  cell  along  that  path.  Backtracing  from  the  successful  last  word 
{the),  we  can  reconstruct  the  word  sequence  I need  the. 


rely  in  Figure  7.10  on  the  ‘Hansel  and  Gretel’  method  (or  the  ‘Jason  and 
the  Minotaur’  method  if  you  like  your  metaphors  more  classical):  whenever 
we  moved  into  a cell,  we  kept  pointers  back  to  the  cell  we  came  from.  The 
reader  should  convince  themselves  that  the  Viterbi  algorithm  has  simultane- 
ously solved  the  segmentation  and  decoding  problems. 

The  presentation  of  the  Viterbi  algorithm  in  this  section  has  been  sim- 
plified; actual  implementations  of  Viterbi  decoding  arc  more  complex  in 
three  key  ways  that  we  have  mentioned  already.  First,  in  an  actual  HMM 
for  speech  recognition,  the  input  would  not  be  phones.  Instead,  the  input 
is  a feature  vector  of  spectral  and  acoustic  features.  Thus  the  observation 
likelihood  probabilities  bj(t)  of  an  observation  o,  given  a state  i will  not 
simply  take  on  the  values  0 or  1,  but  will  be  more  fine-grained  probability 
estimates,  computed  via  mixtures  of  Gaussian  probability  estimators  or  neu- 
ral nets.  The  next  section  will  show  how  these  probabilities  are  computed. 

Second,  the  HMM  states  in  most  speech  recognition  systems  arc  not 
simple  phones  but  rather  subphones.  In  these  systems  each  phone  is  divided 
into  3 states:  the  beginning,  middle  and  final  portions  of  the  phone.  Dividing 
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up  a phone  in  this  way  captures  the  intuition  that  the  significant  changes  in 
the  acoustic  input  happen  at  a finer  granularity  than  the  phone;  for  exam- 
ple the  closure  and  release  of  a stop  consonant.  Furthermore,  many  systems 
use  a separate  instance  of  each  of  these  subphones  for  each  triphone  con-  triphone 
text  (Schwartz  et  ah,  1985;  Deng  et  al.,  1990).  Thus  instead  of  around  60 
phone  units,  there  could  be  as  many  as  603  context-dependent  triphones.  In 
practice,  many  possible  sequences  of  phones  never  occur  or  arc  very  rare, 
so  systems  create  a much  smaller  number  of  tri  phones  models  by  clustering 
the  possible  triphones  (Young  and  Woodland,  1994).  Figure  7.11  shows  an 
example  of  the  complete  phone  model  for  the  triphone  b(ax,aw). 


/b(ax,aw)' 
l left  , 

\ /b(ax,aw) 

) J 

4>(ax,aw)\ 
i right  j 

/ l middle 

I ^ 

Figure  7.11  An  example  of  the  context-dependent  triphone  b(ax,aw)  (the 
phone  [b]  preceded  by  a [ax]  and  followed  by  a [aw],  as  in  the  beginning  of 
about , showing  its  left,  middle,  and  right  subphones. 

Finally,  in  practice  in  large-vocabulary  recognition  it  is  too  expensive 
to  consider  all  possible  words  when  the  algorithm  is  extending  paths  from 
one  state-column  to  the  next.  Instead,  low-probability  paths  arc  pruned  at 
each  time  step  and  not  extended  to  the  next  state  column.  This  is  usually  im- 
plemented via  beam  search:  for  each  state  column  (time  step),  the  algorithm  search 
maintains  a short  list  of  high-probability  words  whose  path  probabilities  arc 
within  some  percentage  (beam  width)  of  the  most  probable  word  path.  Only  beam  width 
transitions  from  these  words  arc  extended  when  moving  to  the  next  time  step. 

Since  the  words  arc  ranked  by  the  probability  of  the  path  so  far,  which  words 
arc  within  the  beam  (active)  will  change  from  time  step  to  time  step.  Making 
this  beam  search  approximation  allows  a significant  speed-up  at  the  cost  of 
a degradation  to  the  decoding  performance.  This  beam  search  strategy  was 
first  implemented  by  Lowerre  (1968).  Because  in  practice  most  implemen- 
tations of  Viterbi  use  beam  search,  some  of  the  literature  uses  the  term  beam 
search  or  time-synchronous  beam  search  instead  of  Viterbi. 
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7.4  Advanced  Methods  for  Decoding 


There  arc  two  main  limitations  of  the  Viterbi  decoder.  First,  the  Viterbi 
decoder  does  not  actually  compute  the  sequence  of  words  which  is  most 
probable  given  the  input  acoustics.  Instead,  it  computes  an  approximation  to 
this:  the  sequence  of  states  (i.e.  phones  or  subphones ) which  is  most  proba- 
ble given  the  input.  This  difference  may  not  always  be  important;  the  most 
probable  sequence  of  phones  may  very  well  correspond  exactly  to  the  most 
probable  sequence  of  words.  But  sometimes  the  most  probable  sequence 
of  phones  does  not  correspond  to  the  most  probable  word  sequence.  For 
example  consider  a speech  recognition  system  whose  lexicon  has  multiple 
pronunciations  for  each  word.  Suppose  the  correct  word  sequence  includes 
a word  with  very  many  pronunciations.  Since  the  probabilities  leaving  the 
start  arc  of  each  word  must  sum  to  1.0,  each  of  these  pronunciation-paths 
through  this  multiple-pronunciation  HMM  word  model  will  have  a smaller 
probability  than  the  path  through  a word  with  only  a single  pronunciation 
path.  Thus  because  the  Viterbi  decoder  can  only  follow  one  of  these  pronun- 
ciation paths,  it  may  ignore  this  word  in  favor  of  an  incorrect  word  with  only 
one  pronunciation  path. 

A second  problem  with  the  Viterbi  decoder  is  that  it  cannot  be  used 
with  all  possible  language  models.  In  fact,  the  Viterbi  algorithm  as  we  have 
defined  it  cannot  take  complete  advantage  of  any  language  model  more  com- 
plex than  a bigram  grammar.  This  is  because  of  the  fact  mentioned  early  that 
a trigram  grammar,  for  example,  violates  the  dynamic  programming  in- 
variant that  makes  dynamic  programming  algorithms  possible.  Recall  that 
this  invariant  is  the  simplifying  (but  incorrect)  assumption  that  if  the  ultimate 
best  path  for  the  entire  observation  sequence  happens  to  go  through  a state 
qi,  that  this  best  path  must  include  the  best  path  up  to  and  including  state 
qi.  Since  a trigram  grammar  allows  the  probability  of  a word  to  be  based  on 
the  two  previous  words,  it  is  possible  that  the  best  trigram-probability  path 
for  the  sentence  may  go  through  a word  but  not  include  the  best  path  to  that 
word.  Such  a situation  could  occur  if  a particular  word  wx  has  a high  tri- 
gram probability  given  ivv.  wz,  but  that  conversely  the  best  path  to  wy  didn’t 
include  wz  (i.e.  P(wy\wq.wz)  was  low  for  all  q). 

There  are  two  classes  of  solutions  to  these  problems  with  Viterbi  de- 
coding. One  class  involves  modifying  the  Viterbi  decoder  to  return  mul- 
tiple potential  utterances  and  then  using  other  high-level  language  model 
or  pronunciation-modeling  algorithms  to  re-rank  these  multiple  outputs.  In 
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general  this  kind  of  multiple-pass  decoding  allows  a computationally  effi- 
cient, but  perhaps  unsophisticated,  language  model  like  a bigram  to  perform 
a rough  first  decoding  pass,  allowing  more  sophisticated  but  slower  decoding 
algorithms  to  run  on  a reduced  search  space. 

For  example,  Schwartz  and  Chow  (1990)  give  a Viterbi-like  algorithm 
which  returns  the  A-best  sentences  (word  sequences)  for  a given  speech  in-  A-best 
put.  Suppose  for  example  a bigram  grammar  is  used  with  this  A-best-Viterbi 
to  return  the  10,000  most  highly-probable  sentences,  each  with  their  likeli- 
hood score.  A trigram-grammar  can  then  be  used  to  assign  a new  language- 
model  prior  probability  to  each  of  these  sentences.  These  priors  can  be  com- 
bined with  the  acoustic  likelihood  of  each  sentence  to  generate  a posterior 
probability  for  each  sentence.  Sentences  can  then  be  rescored  using  this  rescored 
more  sophisticated  probability.Figure  7.12  shows  an  intuition  for  this  algo- 
rithm. 


speech  , 
input  ■:\ 

If  music  be  the  »** 
food  of  love... 

Simple  Smarter 

Knowledge  Knowledge 

Source  Source 

O N~BeStListK  O 1-Best  Utterance 

N-Best 

Decoder 

?Alice  was  beginning  to  get.X 
?Every  happy  family... 

?ln  a hole  in  the  ground...  'v 

?lf  music  be  the  food  of  love...  / 
?lf  music  be  the  foot  of  dove..Vr 

Rescoring 

V 

1/ 

Figure  7.12  The  use  of  A-best  decoding  as  part  of  a two-stage  decoding 
model.  Efficient  but  unsophisticated  knowledge  sources  are  used  to  return  the 
A-best  utterances.  This  significantly  reduces  the  search  space  for  the  second 
pass  models,  which  are  thus  free  to  be  very  sophisticated  but  slow. 

An  augmentation  of  A-best,  still  paid  of  this  first  class  of  extensions  to 
Viterbi,  is  to  return,  not  a list  of  sentences,  but  a word  lattice.  A word  lattice  lattice 
is  a directed  graph  of  words  and  links  between  them  which  can  compactly 
encode  a large  number  of  possible  sentences.  Each  word  in  the  lattice  is  aug- 
mented with  its  observation  likelihood,  so  that  any  particular  path  through 
the  lattice  can  then  be  combined  with  the  prior  probability  derived  from  a 
more  sophisticated  language  model.  For  example  Murveit  et  ah  (1993)  de- 
scribe an  algorithm  used  in  the  SRI  recognizer  Decipher  which  uses  a bigram 
grammar  in  a rough  first  pass,  producing  a word  lattice  which  is  then  refined 
by  a more  sophisticated  language  model. 

The  second  solution  to  the  problems  with  Viterbi  decoding  is  to  employ 
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STACK 

DECODER 


A SEARCH 


a completely  different  decoding  algorithm.  The  most  common  alternative 
algorithm  is  the  stack  decoder,  also  called  the  A*  decoder  (Jelinek,  1969; 
Jelinek  et  ah,  1975).  We  will  describe  the  algorithm  in  terms  of  the  A* 
search  used  in  the  artificial  intelligence  literature,  although  the  development 
of  stack  decoding  actually  came  from  the  communications  theory  literature 
and  the  link  with  AI  best-first  search  was  noticed  only  later  (Jelinek,  1976). 


PRIORITY 

QUEUE 


A*  Decoding 

To  see  how  the  A*  decoding  method  works,  we  need  to  revisit  the  Viterbi  al- 
gorithm. Recall  that  the  Viterbi  algorithm  computed  an  approximation  of  the 
forward  algorithm.  Viterbi  computes  the  observation  likelihood  of  the  single 
best  (MAX)  path  through  the  HMM,  while  the  forward  algorithm  computes 
the  observation  likelihood  of  the  total  (SUM)  of  all  the  paths  through  the 
HMM.  But  we  accepted  this  approximation  because  Viterbi  computed  this 
likelihood  and  searched  for  the  optimal  path  simultaneously.  The  A*  decod- 
ing algorithm,  on  the  other  hand,  will  rely  on  the  complete  forward  algorithm 
rather  than  an  approximation.  This  will  ensure  that  we  compute  the  correct 
observation  likelihood.  Furthermore,  the  A*  decoding  algorithm  allows  us 
to  use  any  arbitrary  language  model. 

The  A*  decoding  algorithm  is  a kind  of  best-first  search  of  the  lattice  or 
free  which  implicitly  defines  the  sequence  of  allowable  words  in  a language. 
Consider  the  free  in  Figure  7.13,  rooted  in  the  START  node  on  the  left.  Each 
leaf  of  this  free  defines  one  sentence  of  the  language;  the  one  formed  by 
concatenating  all  the  words  along  the  path  from  START  to  the  leaf.  We 
don’t  represent  this  tree  explicitly,  but  the  stack  decoding  algorithm  uses  the 
tree  implicitly  as  a way  to  structure  the  decoding  search. 

The  algorithm  performs  a search  from  the  root  of  the  tree  toward  the 
leaves,  looking  for  the  highest  probability  path,  and  hence  the  highest  prob- 
ability sentence.  As  we  proceed  from  root  toward  the  leaves,  each  branch 
leaving  a given  word  node  represent  a word  which  may  follow  the  current 
word.  Each  of  these  branches  has  a probability,  which  expresses  the  condi- 
tional probability  of  this  next  word  given  the  paid  of  the  sentence  we’ve  seen 
so  far.  In  addition,  we  will  use  the  forward  algorithm  to  assign  each  word  a 
likelihood  of  producing  some  paid  of  the  observed  acoustic  data.  The  A*  de- 
coder must  thus  find  the  path  (word  sequence)  from  the  root  to  a leaf  which 
has  the  highest  probability,  where  a path  probability  is  defined  as  the  prod- 
uct of  its  language  model  probability  (prior)  and  its  acoustic  match  to  the 
data  (likelihood).  It  does  this  by  keeping  a priority  queue  of  partial  paths 
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(i.e.  prefixes  of  sentences,  each  annotated  with  a score).  In  a priority  queue 
each  element  has  a score,  and  the  pop  operation  returns  the  element  with 
the  highest  score.  The  A*  decoding  algorithm  iteratively  chooses  the  best 
prclix-so-far,  computes  all  the  possible  next  words  for  that  prefix,  and  adds 
these  extended  sentences  to  the  queue.  The  Figure  7.14  shows  the  complete 
algorithm. 

Let’s  consider  a stylized  example  of  a A*  decoder  working  on  a wave- 
form for  which  the  correct  transcription  is  If  music  be  the  food  of  love.  Fig- 
ure 7.15  shows  the  search  space  after  the  decoder  has  examined  paths  of 
length  one  from  the  root.  A fast  match  is  used  to  select  the  likely  next  fast  match 
words.  A fast  match  is  one  of  a class  of  heuristics  designed  to  efficiently 
winnow  down  the  number  of  possible  following  words,  often  by  comput- 
ing some  approximation  to  the  forward  probability  (see  below  for  further 
discussion  of  fast  matching). 

At  this  point  in  our  example,  we’ve  done  the  fast  match,  selected  a sub- 
set of  the  possible  next  words,  and  assigned  each  of  them  a score.  The  word 
Alice  has  the  highest  score.  We  haven’t  yet  said  exactly  how  the  scoring 
works,  although  it  will  involve  as  a component  the  probability  of  the  hypoth- 
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function  Stack-Decoding()  returns  min-distance 

Initialize  the  priority  queue  with  a null  sentence. 

Pop  the  best  (highest  score)  sentence  ,v  off  the  queue. 

If  (s  is  marked  end-of-sentence  (EOS) ) output  ,v  and  terminate. 

Get  list  of  candidate  next  words  by  doing  fast  matches. 

For  each  candidate  next  word  w: 

Create  a new  candidate  sentence  ,v  + w. 

Use  forward  algorithm  to  compute  acoustic  likelihood  L of  s + w 
Compute  language  model  probability  P of  extended  sentence  ,v  + w 
Compute  ‘score’  for  ,v  + w (a  function  of  L,  P,  and  ???) 
if  (end-of-sentence)  set  EOS  flag  for  ,v  + w. 

Insert  s + w into  the  queue  together  with  its  score  and  EOS  flag 


Figure  7.14  The  A*  decoding  algorithm  (modified  from  Paul  (1991)  and 
Jelinek  (1997)).  The  evaluation  function  that  is  used  to  compute  the  score  for 
a sentence  is  not  completely  defined  here;  possibly  evaluation  functions  are 
discussed  below. 


esized  sentence  given  the  acoustic  input  P(W\A),  which  itself  is  composed 
of  the  language  model  probability  P(W)  and  the  acoustic  likelihood  P(A\W). 

Figure  7.16  show  the  next  stage  in  the  search.  We  have  expanded  the 
Alice  node.  This  means  that  the  Alice  node  is  no  longer  on  the  queue,  but  its 
children  are.  Note  that  now  the  node  labeled  if  actually  has  a higher  score 
than  any  of  the  children  of  Alice. 

Figure  7.17  shows  the  state  of  the  search  after  expanding  the  if  node, 
removing  it,  and  adding  if  music,  if  muscle,  and  if  messy  on  to  the  queue. 

We’ve  implied  that  the  scoring  criterion  for  a hypothesis  is  related  to  its 
probability.  Indeed  it  might  seem  that  the  score  for  a string  of  words  w\  given 
an  acoustic  string  y\  should  be  the  product  of  the  prior  and  the  likelihood: 

P(y[\w\)P(w\) 

Alas,  the  score  cannot  be  this  probability  because  the  probability  will 
be  much  smaller  for  a longer  path  than  a shorter  one.  This  is  due  to  a simple 
fact  about  probabilities  and  substrings;  any  prefix  of  a string  must  have  a 
higher  probability  than  the  string  itself  (e.g.  PfSTART  the. . . ) will  be  greater 
than  PfSTART  the  book)).  Thus  if  we  used  probability  as  the  score,  the  A* 
decoding  algorithm  would  get  stuck  on  the  single-word  hypotheses. 

Instead,  we  use  what  is  called  the  A*  evaluation  function  (Nilsson, 
1980;  Pearl,  1984)  called  f*{p),  given  a partial  path  p: 


P( acoustic  I "if" ) = 
forward  probability 


Figure  7.15  The  beginning  of  the  search  for  the  sentence  If  music  be  the 
food  of  love.  At  this  early  stage  Alice  is  the  most  likely  hypothesis  (it  has  a 
higher  score  than  the  other  hypotheses). 


P(acousticsl  "if  ) = 
forward  probability 


Figure  7.16  The  next  step  of  the  search  for  the  sentence  If  music  be  the 
food  of  love.  We’ve  now  expanded  the  Alice  node,  and  added  three  extensions 
which  have  a relatively  high  score  (was,  wants,  and  walls).  Note  that  now  the 
node  with  the  highest  score  is  START  if,  which  is  not  along  the  START  Alice 
path  at  all! 


f*(p)  =g(p)+h*(p) 
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f *(p)  is  the  estimated  score  of  the  best  complete  path  (complete  sen- 
tence) which  starts  with  the  partial  path  p.  In  other  words,  it  is  an  estimate  of 
how  well  this  path  would  do  if  we  let  it  continue  through  the  sentence.  The 
A*  algorithm  builds  this  estimate  from  two  components: 

• g(p)  is  the  score  from  the  beginning  of  utterance  to  the  end  of  the  par- 
tial path  p.  This  g function  can  be  nicely  estimated  by  the  probability 
of  p given  the  acoustics  so  far  (i.e.  as  P(A\W)P(W)  for  the  word  string 
W constituting  p). 

• h*(p)  is  an  estimate  of  the  best  scoring  extension  of  the  partial  path  to 
the  end  of  the  utterance. 

Coming  up  with  a good  estimate  of  h*  is  an  unsolved  and  interesting 
problem.  One  approach  is  to  choose  as  h*  an  estimate  which  correlates  with 
the  number  of  words  remaining  in  the  sentence  (Paul,  1991);  see  Jelinek 
(1997)  for  further  discussion. 

We  mentioned  above  that  both  the  A*  and  various  other  two-stage  de- 
coding algorithms  require  the  use  of  a fast  match  for  quickly  finding  which 
words  in  the  lexicon  are  likely  candidates  for  matching  some  portion  of  the 
acoustic  input.  Many  fast  match  algorithms  arc  based  on  the  use  of  a tree- 

TREE- 

structured  structured  lexicon,  which  stores  the  pronunciations  of  all  the  words  in  such 
a way  that  the  computation  of  the  forward  probability  can  be  shared  for 
words  which  start  with  the  same  sequence  of  phones.  The  tree-structured 
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lexicon  was  first  suggested  by  Klovstad  and  Mondshein  (1975);  fast  match 
algorithms  which  make  use  of  it  include  Gupta  et  al.  (1988),  Bahl  et  al. 
(1992)  in  the  context  of  A*  decoding,  and  Ney  et  al.  (1992)  and  Nguyen  and 
Schwartz  (1999)  in  the  context  of  Viterbi  decoding.  Figure  7.18  shows  an 
example  of  a tree-structured  lexicon  from  the  Sphinx-II  recognizer  (Ravis- 
hankar.  1996).  Each  tree  root  represents  the  first  phone  of  all  words  begin- 
ning with  that  context  dependent  phone  (phone  context  may  or  may  not  be 
preserved  across  word  boundaries),  and  each  leaf  is  associated  with  a word. 
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Figure  7.18  A tree-structured  lexicon  from  the  Sphinx-II  recognizer  (af- 
ter Ravishankar  (1996)).  Each  node  corresponds  to  a particular  triphone  in  a 
slightly  modified  version  of  the  ARPAbet;  thus  EY(B,KD)  means  the  phone 
EY  preceded  by  a B and  followed  by  the  closure  of  a K. 


There  are  many  other  kinds  of  multiple-stage  search,  such  as  the  forward' 
backward  search  algorithm  (not  to  be  confused  with  the  forward-backward 
algorithm)  (Austin  el  al. , 1991)  which  performs  a simple  forward  search  fol- 
lowed by  a detailed  backward  (i.e.  time-reversed)  search. 


FORWARD- 

BACKWARD 
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7.5  Acoustic  Processing  of  Speech 
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This  section  presents  a very  brief  overview  of  the  kind  of  acoustic  processing 
commonly  called  feature  extraction  or  signal  analysis  in  the  speech  recog- 
nition literature.  The  term  features  refers  to  the  vector  of  numbers  which 
represent  one  time-slice  of  a speech  signal.  A number  of  kinds  of  features 
arc  commonly  used,  such  as  LPC  features  and  PLP  features.  All  of  these  arc 
spectral  features,  which  means  that  they  represent  the  waveform  in  terms  of 
the  distribution  of  different  frequencies  which  make  up  the  waveform;  such 
a distribution  of  frequencies  is  called  a spectrum.  We  will  begin  with  a brief 
introduction  to  the  acoustic  waveform  and  how  it  is  digitized,  summarize  the 
idea  of  frequency  analysis  and  spectra,  and  then  sketch  out  different  kinds  of 
extracted  features.  This  will  be  an  extremely  brief  overview;  the  interested 
reader  should  refer  to  other  books  on  the  linguistics  aspects  of  acoustic  pho- 
netics (Johnson,  1997;  Ladefoged,  1996)  or  on  the  engineering  aspects  of 
digital  signal  processing  of  speech  (Rabiner  and  Juang,  1993). 


Sound  Waves 

The  input  to  a speech  recognizer,  like  the  input  to  the  human  car,  is  a complex 
series  of  changes  in  air  pressure.  These  changes  in  air  pressure  obviously 
originate  with  the  speaker,  and  arc  caused  by  the  specific  way  that  air  passes 
through  the  glottis  and  out  the  oral  or  nasal  cavities.  We  represent  sound 
waves  by  plotting  the  change  in  air  pressure  over  time.  One  metaphor  which 
sometimes  helps  in  understanding  these  graphs  is  to  imagine  a vertical  plate 
which  is  blocking  the  air  pressure  waves  (perhaps  in  a microphone  in  front  of 
a speaker’s  mouth,  or  the  eardrum  in  a hearer’s  car).  The  graph  measures  the 
amount  of  compression  or  rarefaction  (uncompression)  of  the  air  molecules 
at  this  plate.  Figure  7.19  shows  the  waveform  taken  from  the  Switchboard 
corpus  of  telephone  speech  of  someone  saying  “she  just  had  a baby”. 
frequency  Two  important  characteristics  of  a wave  arc  its  frequency  and  ampli- 

amplitude  tude.  The  frequency  is  the  number  of  times  a second  that  a wave  repeats 
itself,  or  cycles.  Note  in  Figure  7.19  that  there  arc  28  repetitions  of  the  wave 
in  the  .11  seconds  we  have  captured.  Thus  the  frequency  of  this  segment  of 
secondper  die  wave  is  28/.  1 1 or  255  cycles  per  second.  Cycles  per  second  arc  usually 

hertz  called  Hertz  (shortened  to  Hz),  so  the  frequency  in  Figure  7. 19  would  be 

described  as  255  Hz. 

The  vertical  axis  in  Figure  7.19  measures  the  amount  of  air  pressure 
amplitude  variation.  A high  value  on  the  vertical  axis  (a  high  amplitude)  indicates 
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Figure  7.19  A waveform  of  the  vowel  [iy]  from  the  utterance  shown  in 
Figure  7.20.  The  y-axis  shows  the  changes  in  air  pressure  above  and  below 
normal  atmospheric  pressure.  The  x-axis  shows  time.  Notice  that  the  wave 
repeats  regularly. 


that  there  is  more  air  pressure  at  that  point  in  time,  a zero  value  means  there 
is  normal  (atmospheric)  air  pressure,  while  a negative  value  means  there  is 
lower  than  normal  air  pressure  (rarefaction). 

Two  important  perceptual  properties  are  related  to  frequency  and  am- 
plitude. The  pitch  of  a sound  is  the  perceptual  correlate  of  frequency;  in  pitch 
general  if  a sound  has  a higher-frequency  we  perceive  it  as  having  a higher 
pitch,  although  the  relationship  is  not  linear,  since  human  hearing  has  differ- 
ent acuities  for  different  frequencies.  Similarly,  the  loudness  of  a sound  is 
the  perceptual  correlate  of  the  power,  which  is  related  to  the  square  of  the 
amplitude.  So  sounds  with  higher  amplitudes  are  perceived  as  louder,  but 
again  the  relationship  is  not  linear. 

How  to  Interpret  a Waveform 

Since  humans  (and  to  some  extent  machines)  can  transcribe  and  understand 
speech  just  given  the  sound  wave,  the  waveform  must  contain  enough  infor- 
mation to  make  the  task  possible.  In  most  cases  this  information  is  hard  to 
unlock  just  by  looking  at  the  waveform,  but  such  visual  inspection  is  still 
sufficient  to  learn  some  things.  For  example,  the  difference  between  vowels 
and  most  consonants  is  relatively  clear  on  a waveform.  Recall  that  vowels 
are  voiced,  tend  to  be  long,  and  are  relatively  loud.  Length  in  time  manifests 
itself  directly  as  length  in  space  on  a waveform  plot.  Loudness  manifests 
itself  as  high  amplitude.  How  do  we  recognize  voicing?  Recall  that  voicing 
is  caused  by  regular  openings  and  closing  of  the  vocal  folds.  When  the  vocal 
folds  are  vibrating,  we  can  see  regular  peaks  in  amplitude  of  the  kind  we  saw 
in  Figure  7.19.  During  a stop  consonant,  for  example  the  closure  of  a [p],  [t], 
or  [k],  we  should  expect  no  peaks  at  all;  in  fact  we  expect  silence. 

Notice  in  Figure  7.20  the  places  where  there  arc  regular  amplitude 
peaks  indicating  voicing;  from  second  .46  to  .58  (the  vowel  [iy]),  from  sec- 
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ond  .65  to  .74  (the  vowel  [ax])  and  so  on.  The  places  where  there  is  no 
amplitude  indicate  the  silence  of  a stop  closure;  for  example  from  second 
1.06  to  second  1.08  (the  closure  for  the  first  [b],  or  from  second  1.26  to  1.28 
(the  closure  for  the  second  [b]). 


id  aJ  i)  J ml  sf  I aj  it  J si 


Figure  7.20  A waveform  of  the  sentence  “She  just  had  a baby”  from  the 
Switchboard  corpus  (conversation  4325).  The  speaker  is  female,  was  20  years 
old  in  1991  which  is  approximately  when  the  recording  was  made,  and  speaks 
the  South  Midlands  dialect  of  American  English.  The  phone  labels  show 
where  each  phone  ends. 


Fricatives  like  [sh]  can  also  be  recognized  in  a waveform;  they  produce 
an  intense  irregular  pattern;  the  [sh]  from  second  .33  to  .46  is  a good  example 
of  a fricative. 

Spectra 

While  some  broad  phonetic  features  (presence  of  voicing,  stop  closures, 
fricatives)  can  be  interpreted  from  a waveform,  more  detailed  classification 
(which  vowel?  which  fricative?)  requires  a different  representation  of  the 
spectral  input  in  terms  of  spectral  features.  Spectral  features  arc  based  on  the  in- 
sight of  Fourier  that  every  complex  wave  can  be  represented  as  a sum  of 
many  simple  waves  of  different  frequencies.  A musical  analogy  for  this  is 
the  chord;  just  as  a chord  is  composed  of  multiple  notes,  any  waveform  is 
composed  of  the  waves  corresponding  to  its  individual  “notes”. 

Consider  Figure  7.21,  which  shows  part  of  the  waveform  for  the  vowel 
[as]  of  the  word  had  at  second  0.9  of  the  sentence.  Note  that  there  is  a com- 
plex wave  which  repeats  about  nine  times  in  the  figure;  but  there  is  also  a 
smaller  repeated  wave  which  repeats  four  times  for  every  larger  pattern  (no- 
tice the  four  small  peaks  inside  each  repeated  wave).  The  complex  wave  has 
a frequency  of  about  250  Hz  (we  can  figure  this  out  since  it  repeats  roughly 
9 times  in  .036  seconds,  and  9 cycles/.036  seconds  = 250  Hz).  The  smaller 
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wave  then  should  have  a frequency  of  roughly  4 times  the  frequency  of  the 
larger  wave,  or  roughly  1000  Hz.  Then  if  you  look  carefully  you  can  see 
two  little  waves  on  the  peak  of  many  of  the  1000  Hz  waves.  The  frequency 
of  this  tiniest  wave  must  be  roughly  twice  that  of  the  1000  Hz  wave,  hence 
2000  Hz. 

A spectrum  is  a representation  of  these  different  frequency  compo-  spectrum 
nents  of  a wave.  It  can  be  computed  by  a Fourier  transform,  a mathematical  transform 
procedure  which  separates  out  each  of  the  frequency  components  of  a wave. 

Rather  than  using  the  Fourier  transform  spectrum  directly,  most  speech  ap- 
plications use  a smoothed  version  of  the  spectrum  called  the  LPC  spectrum  lpc 
(Atal  and  Hanauer,  1971;  Itakura,  1975). 

Figure  7.22  shows  an  LPC  spectrum  for  the  waveform  in  Figure  7.21. 

LPC  (Linear  Predictive  Coding)  is  a way  of  coding  the  spectrum  which 
makes  it  easier  to  see  where  the  spectral  peaks  are.  peaksral 
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The  x-axis  of  a spectrum  shows  frequency  while  the  y-axis  shows  some 
measure  of  the  magnitude  of  each  frequency  component  (in  decibels  (dB), 
a logarithmic  measure  of  amplitude).  Thus  Figure  7.22  shows  that  there  arc 
important  frequency  components  at  930  Hz,  1860  Hz,  and  3020  Hz,  along 
with  many  other  lower-magnitude  frequency  components.  These  important 
components  at  roughly  1000  Hz  and  2000  Hz  arc  just  what  we  predicted  by 
looking  at  the  wave  in  Figure  7.21! 

Why  is  a spectrum  useful?  It  turns  out  that  these  spectral  peaks  that 
arc  easily  visible  in  a spectrum  arc  very  characteristic  of  different  sounds; 
phones  have  characteristic  spectral  ‘signatures’.  For  example  different  chem- 
ical elements  give  off  different  wavelengths  of  light  when  they  burn,  allow- 
ing us  to  detect  elements  in  stars  light-years  away  by  looking  at  the  spectrum 
of  the  light.  Similarly,  by  looking  at  the  spectrum  of  a waveform,  we  can  de- 
tect the  characteristic  signature  of  the  different  phones  that  arc  present.  This 
use  of  spectral  information  is  essential  to  both  human  and  machine  speech 
recognition.  In  human  audition,  the  function  of  the  cochlea  or  inner  ear  is 
to  compute  a spectrum  of  the  incoming  waveform.  Similarly,  the  features 
used  as  input  to  the  HMMs  in  speech  recognition  arc  all  representations  of 
spectra,  usually  valiants  of  LPC  spectra,  as  we  will  see. 

While  a spectrum  shows  the  frequency  components  of  a wave  at  one 
point  in  time,  a spectrogram  is  a way  of  envisioning  how  the  different  fre- 
quencies which  make  up  a waveform  change  over  time.  The  x-axis  shows 
time,  as  it  did  for  the  waveform,  but  the  y-axis  now  shows  frequencies  in  Hz. 
The  darkness  of  a point  on  a spectrogram  corresponding  to  the  amplitude  of 
the  frequency  component.  For  example,  look  in  Figure  7.23  around  second 
0.9,  and  notice  the  dark  bar  at  around  1000  Hz.  This  means  that  the  [iy] 
of  the  word  she  has  an  important  component  around  1000  Hz  (1000  Hz  is 
just  between  the  notes  B and  C).  The  dark  horizontal  bars  on  a spectrogram, 
representing  spectral  peaks,  usually  of  vowels,  arc  called  formants. 

What  specific  clues  can  spectral  representations  give  for  phone  identi- 
fication? First,  different  vowels  have  their  formants  at  characteristic  places. 
We’ve  seen  that  [as]  in  the  sample  waveform  had  formants  at  930  Hz,  1860 
Hz,  and  3020  Hz.  Consider  the  vowel  [iy],  at  the  beginning  of  the  utterance 
in  Figure  7.20.  The  spectrum  for  this  vowel  is  shown  in  Figure  7.24.  The  first 
formant  of  [iy]  is  540  Hz;  much  lower  than  the  first  formant  for  [as],  while  the 
second  formant  (2581  Hz)  is  much  higher  than  the  second  formant  for  [ac] . 
If  you  look  carefully  you  can  see  these  formants  as  dark  bars  in  Figure  7.23 
just  around  0.5  seconds. 

The  location  of  the  first  two  formants  (called  F 1 and  F2)  plays  a large 
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Figure  7.23  A spectrogram  of  the  sentence  “She  just  had  a baby”  whose 
waveform  was  shown  in  Figure  7.20.  One  way  to  think  of  a spectrogram  is  as 
a collection  of  spectra  (time-slices)  like  Figure  7.22  placed  end  to  end. 
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Figure  7.24  A smoothed  (LPC)  spectrum  for  the  vowel  [iy]  at  the  start  of 
She  just  had  a baby.  Note  that  the  first  formant  (540  Hz)  is  much  lower  than 
the  first  formant  for  [as]  shown  in  Figure  7.22,  while  the  second  formant  (2581 
Hz)  is  much  higher  than  the  second  formant  for  [as]. 


role  in  determining  vowel  identity,  although  the  formants  still  differ  from 
speaker  to  speaker.  Formants  also  can  be  used  to  identify  the  nasal  phones 
[n],  [m],  and  [p],  the  lateral  phone  [1],  and  [r].  Why  do  different  vowels  have 
different  spectral  signatures?  The  formants  are  caused  by  the  resonant  cav- 
ities of  the  mouth.  The  oral  cavity  can  be  thought  of  as  a filter  which  se- 
lectively passes  through  some  of  the  harmonics  of  the  vocal  cord  vibrations. 
Moving  the  tongue  creates  spaces  of  different  size  inside  the  mouth  which 
selectively  amplify  waves  of  the  appropriate  wavelength,  hence  amplifying 
different  frequency  bands. 
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Our  survey  of  the  features  of  waveforms  and  spectra  was  necessarily  brief, 
but  the  reader  should  have  the  basic  idea  of  the  importance  of  spectral  fea- 
tures and  their  relation  to  the  original  waveform.  Let’s  now  summarize  the 
process  of  extraction  of  spectral  features,  beginning  with  the  sound  wave 
itself  and  ending  with  a feature  vector.4  An  input  soundwave  is  first  dig- 
itized. This  process  of  analog-to-digital  conversion  has  two  steps:  sam- 
pling and  quantization.  A signal  is  sampled  by  measuring  its  amplitude 
at  a particular  time;  the  sampling  rate  is  the  number  of  samples  taken  per 
second.  Common  sampling  rates  arc  8,000  Hz  and  16,000  Hz.  In  order  to 
accurately  measure  a wave,  it  is  necessary  to  have  at  least  two  samples  in 
each  cycle:  one  measuring  the  positive  paid  of  the  wave  and  one  measuring 
the  negative  paid.  More  than  two  samples  per  cycle  increases  the  amplitude 
accuracy,  but  less  than  two  samples  will  cause  the  frequency  of  the  wave  to 
be  completely  missed.  Thus  the  maximum  frequency  wave  that  can  be  mea- 
sured is  one  whose  frequency  is  half  the  sample  rate  (since  every  cycle  needs 
2 samples).  This  maximum  frequency  for  a given  sampling  rate  is  called  the 
Nyquist  frequency.  Most  information  in  human  speech  is  in  frequencies  be- 
low 10,000  Hz;  thus  a 20,000  Hz  sampling  rate  would  be  necessary  for  com- 
plete accuracy.  But  telephone  speech  is  filtered  by  the  switching  network, 
and  only  frequencies  less  than  4,000  Hz  arc  transmitted  by  telephones.  Thus 
an  8,000  Hz  sampling  rate  is  sufficient  for  telephone -bandwidth  speech  like 
the  Switchboard  corpus. 

Even  an  8,000  Hz  sampling  rate  requires  8000  amplitude  measure- 
ments for  each  second  of  speech,  and  so  it  is  important  to  store  the  amplitude 
measurement  efficiently.  They  arc  usually  stored  as  integers,  either  8-bit 
(values  from  -128  - 127)  or  16  bit  (values  from  -32768  - 32767).  This  pro- 
cess of  representing  a real- valued  number  as  a integer  is  called  quantization 
because  there  is  a minimum  granularity  (the  quantum  size)  and  all  values 
which  arc  closer  together  than  this  quantum  size  arc  represented  identically. 

Once  a waveform  has  been  digitized,  it  is  converted  to  some  set  of 
spectral  features.  An  LPC  spectrum  is  represented  by  a vector  of  features; 
each  formant  is  represented  by  two  features,  plus  two  additional  features  to 
represent  spectral  tilt.  Thus  5 formants  can  be  represented  by  12  (5x2+2) 
features.  It  is  possible  to  use  LPC  features  directly  as  the  observation  sym- 


4 The  reader  might  want  to  bear  in  mind  Picone’s  (1993)  reminder  that  the  use  of  the  word 
extraction  should  not  be  thought  of  as  encouraging  the  metaphor  of  features  as  something 
'in  the  signal"  waiting  to  be  extracted. 
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bols  of  an  HMM.  However,  further  processing  is  often  done  to  the  features. 

CEPSTR/ 

One  popular  feature  set  is  cepstral  coefficients,  which  arc  computed  from  coefr- 
the  LPC  coefficients  by  taking  the  Fourier  transform  of  the  spectrum.  An- 
other feature  set,  PLP  (Perceptual  Linear  Predictive  analysis  (Hermansky,  plp 
1990)),  takes  the  LPC  features  and  modifies  them  in  ways  consistent  with 
human  healing.  For  example,  the  spectral  resolution  of  human  healing  is 
worse  at  high  frequencies,  and  the  perceived  loudness  of  a sound  is  related 
to  the  cube  rate  of  its  intensity.  So  PLP  applies  various  filters  to  the  LPC 
spectrum  and  takes  the  cube  root  of  the  features. 


7.6  Computing  Acoustic  Probabilities 


The  last  section  showed  how  the  speech  input  can  be  passed  through  signal 
processing  transformations  and  turned  into  a series  of  vectors  of  features, 
each  vector  representing  one  time-slice  of  the  input  signal.  How  arc  these 
feature  vectors  turned  into  probabilities? 

One  way  to  compute  probabilities  on  feature  vectors  is  to  first  cluster 
them  into  discrete  symbols  that  we  can  count;  we  can  then  compute  the 
probability  of  a given  cluster  just  by  counting  the  number  of  times  it  occurs  in 
some  training  set.  This  method  is  usually  called  vector  quantization.  Vector 
quantization  was  quite  common  in  early  speech  recognition  algorithms  but 
has  mainly  been  replaced  by  a more  direct  but  compute-intensive  approach: 
computing  observation  probabilities  on  a real-valued  (‘continuous’)  input 
vector.  This  method  thus  computes  a probability  density  function  or  pdf 
over  a continuous  space. 

There  arc  two  popular  versions  of  the  continuous  approach.  The  most 
widespread  of  the  two  is  the  use  of  Gaussian  pdfs,  in  the  simplest  ver- 
sion of  which  each  state  has  a single  Gaussian  function  which  maps  the 
observation  vector  ot  to  a probability.  An  alternative  approach  is  the  use 
of  neural  networks  or  multi-layer  perceptrons  which  can  also  be  trained 
to  assign  a probability  to  a real-valued  feature  vector.  HMMs  with  Gaus- 
sian observation-probability-estimators  arc  trained  by  a simple  extension  to 
the  forward-backward  algorithm  (discussed  in  Appendix  D).  HMMs  with 
neural-net  observation-probability-estimators  arc  trained  by  a completely 
different  algorithm  known  as  error  back-propagation. 

In  the  simplest  use  of  Gaussians,  we  assume  that  the  possible  values 
of  the  observation  feature  vector  o,  arc  normally  distributed,  and  so  we  rep- 
resent the  observation  probability  function  bj(o, ) as  a Gaussian  curve  with 
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mean  vector  p7  and  covariance  matrix  (prime  denotes  vector  transpose). 
We  present  the  equation  here  for  completeness,  although  we  will  not  cover 
the  details  of  the  mathematics: 

bj{ot)  = 1 e[(Qt-f'j)'Zj1(o,-nj)}  (7.10) 

V(271)ILj'I 

Usually  we  make  the  simplifying  assumption  that  the  covariance  ma- 
trix £7  is  diagonal,  i.e.  that  it  contains  the  simple  variance  of  cepstral  feature 
1,  the  simple  variance  of  cepstral  feature  2,  and  so  on,  without  worrying 
about  the  effect  of  cepstral  feature  1 on  the  variance  of  cepstral  feature  2. 
This  means  that  in  practice  we  are  keeping  only  a single  separate  mean  and 
variance  for  each  feature  in  the  feature  vector. 

Most  recognizers  do  something  even  more  complicated;  they  keep 
multiple  Gaussians  for  each  state,  so  that  the  probability  of  each  feature  of 
the  observation  vector  is  computed  by  adding  together  a variety  of  Gaussian 
curves.  This  technique  is  called  Gaussian  mixtures.  In  addition,  many  ASR 
systems  share  Gaussians  between  states  in  a technique  known  as  parameter 
tying  (or  tied  mixtures)  (Huang  and  Jack,  1989).  For  example  acoustically 
similar  phone  states  might  share  (i.e.  use  the  same)  Gaussians  for  some  fea- 
tures. 

How  arc  the  mean  and  covariance  of  the  Gaussians  estimated?  It  is 
helpful  again  to  consider  the  simpler  case  of  a non-hidden  Markov  Model, 
with  only  one  state  i.  The  vector  of  feature  means  p and  the  vector  of  covari- 
ances £ could  then  be  estimated  by  averaging: 
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t=  l 


(7.11) 

(7.12) 


But  since  there  arc  multiple  hidden  states,  we  don’t  know  which  obser- 
vation vector  ot  was  produced  by  which  state.  Appendix  D will  show  how 
the  forward-backward  algorithm  can  be  modified  to  assign  each  observation 
vector  ot  to  every  possible  state  i,  prorated  by  the  probability  that  the  HMM 
was  in  state  i at  time  t. 

An  alternative  way  to  model  continuous-valued  features  is  the  use  of  a 
network  neural  network,  multilayer  perceptron  (MLP)  or  Artificial  Neural  Net- 

perceptron  works  (ANNs).  Neural  networks  arc  far  too  complex  for  us  to  introduce  in 
mlp  a page  or  two  here;  thus  we  will  just  give  the  intuition  of  how  they  arc  used 
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in  probability  estimation  as  an  alternative  to  Gaussian  estimators.  The  inter- 
ested reader  should  consult  basic  neural  network  textbooks  (Anderson,  1995; 
Hertz  et  at,  1991)  as  well  as  references  specifically  focusing  on  neural- 
network  speech  recognition  (Bourlard  and  Morgan,  1994). 

A neural  network  is  a set  of  small  computation  units  connected  by 
weighted  links.  The  network  is  given  a vector  of  input  values  and  computes 
a vector  of  output  values.  The  computation  proceeds  by  each  computational 
unit  computing  some  non-linear  function  of  its  input  units  and  passing  the 
resulting  value  on  to  its  output  units. 

The  use  of  neural  networks  we  will  describe  here  is  often  called  a hy- 
brid HMM-MLP  approach,  since  it  uses  some  elements  of  the  HMM  (such 
as  the  state-graph  representation  of  the  pronunciation  of  a word)  but  the 
observation-probability  computation  is  done  by  an  MLP  instead  of  a mix- 
ture of  Gaussians.  The  input  to  these  MLPs  is  a representation  of  the  signal 
at  a time  t and  some  surrounding  window;  for  example  this  might  mean  a 
vector  of  spectral  features  for  a time  t and  8 additional  vectors  for  times 
t + I Oms,  t + 20ms,  t + 30 ms,  t + 40ms,  t — 1 0ms,  etc.  Thus  the  input  to 
the  network  is  a set  of  nine  vectors,  each  vector  having  the  complete  set  of 
real-valued  spectral  features  for  one  time  slice.  The  network  has  one  output 
unit  for  each  phone;  by  constraining  the  values  of  all  the  output  units  to  sum 
to  1,  the  net  can  be  used  to  compute  the  probability  of  a state  j given  an 
observation  vector  ot,  or  P(j\ot).  Figure  7.25  shows  a sample  of  such  a net. 

This  MLP  computes  the  probability  of  the  HMM  state  j given  an  ob- 
servation ot,  or  P(qj\ot).  But  the  observation  likelihood  we  need  for  the 
HMM,  bj{ot),  is  P(ot\qj).  The  Bayes  rule  can  help  us  see  how  to  compute 
one  from  the  other.  The  net  is  computing: 


Pi<lj\ot)  = 


p(°t\qj)p{qj 


P{°,) 

We  can  rearrange  the  terms  as  follows: 


(7.13) 


p(°t\qj)  P(qj\ot)  (1U. 

p(ot)  p{qj) 

The  two  terms  on  the  right-hand  side  of  (7.14)  can  be  directly  com- 
puted from  the  MLP;  the  numerator  is  the  output  of  the  MLP,  and  the  denom- 
inator is  the  total  probability  of  a given  state,  summing  over  all  observations 
(i.e.  the  sum  over  ah  t of  Gj(t)).  Thus  although  we  cannot  directly  compute 
P(°t  \qj),  we  can  use  (7.14)  to  compute  which  is  known  as  a scaled 

likelihood  (the  likelihood  divided  by  the  probability  of  the  observation)  . 
In  fact,  the  scaled  likelihood  is  just  as  good  as  the  regular  likelihood,  since 
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Figure  7.25  A neural  net  used  to  estimate  phone  state  probabilities.  Such 
a net  can  be  used  in  an  HMM  model  as  an  alternative  to  the  Gaussian  models. 
This  particular  net  is  from  the  MLP  systems  described  in  Bourlard  and  Morgan 
(1994);  it  is  given  a vector  of  features  for  a frame  and  for  the  four  frames 
on  either  side,  and  estimates  p{qj\o,).  This  probability  is  then  converted  to 
an  estimate  of  the  observation  likelihood  b = p{ot\qj)  using  the  Bayes  rule. 
These  nets  are  trained  using  the  error-back-propagation  algorithm  as  part  of 
the  same  embedded  training  algorithm  that  is  used  for  Gaussians. 


the  probability  of  the  observation  p(ot ) is  a constant  during  recognition  and 
doesn’t  hurt  us  to  have  in  the  equation. 

The  error-back-propagation  algorithm  for  training  an  MLP  requires 
that  we  know  the  correct  phone  label  cjj  for  each  observation  o,.  Given  a 
large  training  set  of  observations  and  correct  labels,  the  algorithm  iteratively 
adjusts  the  weights  in  the  MLP  to  minimize  the  error  with  this  training  set. 
In  the  next  section  we  will  see  where  this  labeled  training  set  comes  from, 
and  how  this  training  fits  in  with  the  embedded  training  algorithm  used 
for  HMMs.  Neural  nets  seem  to  achieve  roughly  the  same  performance  as 
a Gaussian  model  but  have  the  advantage  of  using  less  parameters  and  the 
disadvantage  of  taking  somewhat  longer  to  train. 


Methodology  Box:  Word  Error  Rate 


The  standard  evaluation  metric  for  speech  recognition  systems 
is  the  word  error  rate.  The  word  error  rate  is  based  on  how  much 
the  word  string  returned  by  the  recognizer  (often  called  the  hypoth- 
esized word  string)  differs  from  a correct  or  reference  transcription. 
Given  such  a correct  transcription,  the  first  step  in  computing  word 
error  is  to  compute  the  minimum  edit  distance  in  words  between 
the  hypothesized  and  correct  strings.  The  result  of  this  computation 
will  be  the  minimum  number  of  word  substitutions,  word  inser- 
tions, and  word  deletions  necessary  to  map  between  the  correct  and 
hypothesized  strings.  The  word  error  rate  is  then  defined  as  follows 
(note  that  because  the  equation  includes  insertions,  the  error  rate  can 
be  great  than  100%): 

Insertions  + Substitutions  + Deletions 

Word  Error  Rate  = 100 

Total  Words  in  Collect  Transcript 

Here  is  an  example  of  alignments  between  a reference  and  a 
hypothesized  utterance  from  the  CALLHOME  coipus,  showing  the 
counts  used  to  compute  the  word  error  rate: 
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This  utterance  has  6 substitutions,  3 insertions,  and  1 deletion: 

6 + 3 + 1 

Word  Error  Rate  = 100 =56% 

18 

As  of  the  time  of  this  writing,  state-of-the-art  speech  recognition 
systems  were  achieving  around  20%  word  error  rate  on  natural- 
speech  tasks  like  the  National  Institute  of  Standards  and  Technology 
(NIST)’s  Hub4  test  set  from  the  Broadcast  News  corpus  (Chen  el  al. , 
1999),  and  around  40%  word  error  rate  on  NIST’s  Hub5  test  set  from 
the  combined  Switchboard,  Switchboard-II,  and  CALLHOME  cor- 
pora (Hain  et  al.,  1999). 
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7.7  Training  a Speech  Recognizer 

We  have  now  introduced  all  the  algorithms  which  make  up  the  standard 
speech  recognition  system  that  was  sketched  in  Figure  7.2  on  page  239. 
We’ve  seen  how  to  build  a Viterbi  decoder,  and  how  it  takes  3 inputs  (the 
observation  likelihoods  (via  Gaussian  or  MLP  estimation  from  the  spectral 
features),  the  HMM  lexicon,  and  the  /V-gram  language  model)  and  produces 
the  most  probable  string  of  words.  But  we  have  not  seen  how  all  the  proba- 
bilistic models  that  make  up  a recognizer  get  trained. 
training0  In  this  section  we  give  a brief  sketch  of  the  embedded  training  proce- 

dure that  is  used  by  most  ASR  systems,  whether  based  on  Gaussians,  MLPs, 
or  even  vector  quantization.  Some  of  the  details  of  the  algorithm  (like  the 
for  ward- hack  ward  algorithm  for  training  HMM  probabilities)  have  been  re- 
moved to  Appendix  D. 

Let’s  begin  by  summarizing  the  four  probabilistic  models  we  need  to 
train  in  a basic  speech  recognition  system: 

• language  model  probabilities:  P(w, jw,_iw,_2) 

• observation  likelihoods:  bj (o, ) 

• transition  probabilities:  cijj 

• pronunciation  lexicon:  HMM  state  graph  structure 
In  order  to  train  these  components  we  usually  have 

• a training  corpus  of  speech  wavefiles,  together  with  a word-transcription. 

• a much  larger  corpus  of  text  for  training  the  language  model,  includ- 
ing the  word-transcriptions  from  the  speech  corpus  together  with  many 
other  similar  texts. 

• often  a smaller  training  corpus  of  speech  which  is  phonetically  labeled 
(i.e.  frames  of  the  acoustic  signal  arc  hand-annotated  with  phonemes). 

Let’s  begin  with  the  /V-gram  language  model.  This  is  trained  in  the 
way  we  described  in  Chapter  6;  by  counting  /V-gram  occurrences  in  a large 
corpus,  then  smoothing  and  normalizing  the  counts.  The  corpus  used  for 
training  the  language  model  is  usually  much  larger  than  the  corpus  used  to 
train  the  HMM  a and  b parameters.  This  is  because  the  larger  the  training 
corpus  the  more  accurate  the  models.  Since  /V-gram  models  arc  much  faster 
to  train  than  HMM  observation  probabilities,  and  since  text  just  takes  less 
space  than  speech,  it  turns  out  to  be  feasible  to  train  language  models  on 
huge  corpora  of  as  much  as  half  a billion  words  of  text.  Generally  the  corpus 
used  for  training  the  HMM  parameters  is  included  as  paid  of  the  language 
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model  training  data;  it  is  important  that  the  acoustic  and  language  model 
training  be  consistent. 

The  HMM  lexicon  structure  is  built  by  hand,  by  taking  an  off-the-shelf 
pronunciation  dictionary  such  as  the  PRONLEX  dictionary  (LDC,  1995)  or 
the  CMUdict  dictionary,  both  described  in  Chapter  4.  In  some  systems,  each 
phone  in  the  dictionary  maps  into  a state  in  the  HMM.  So  the  word  cat  would 
have  3 states  corresponding  to  [k],  [ae],  and  [t].  Many  systems,  however,  use 
the  more  complex  subphone  structure  described  on  page  249,  in  which  each 
phone  is  divided  into  3 states:  the  beginning,  middle  and  final  portions  of 
the  phone,  and  in  which  furthermore  there  arc  separate  instances  of  each  of 
these  subphones  for  each  triphone  context. 

The  details  of  the  embedded  training  of  the  HMM  parameters  varies; 
we'll  present  a simplified  version.  First,  we  need  some  initial  estimate  of 
the  transition  and  observation  probabilities  a,7  and  bj(o,).  For  the  transi- 
tion probabilities,  we  start  by  assuming  that  for  any  state  all  the  possible 
following  states  arc  all  equiprobable.  The  observation  probabilities  can  be 
bootstrapped  from  a small  hand-labeled  training  corpus.  For  example,  the 
TIMIT  or  Switchboard  corpora  contain  approximately  4 hours  each  of  pho- 
netically labeled  speech.  They  supply  a ‘correct’  phone  state  label  q for  each 
frame  of  speech.  These  can  be  fed  to  an  MFP  or  averaged  to  give  initial 
Gaussian  means  and  variances.  For  MFPs  this  initial  estimate  is  important, 
and  so  a hand-labeled  bootstrap  is  the  norm.  For  Gaussian  models  the  initial 
value  of  the  parameters  seems  to  be  less  important  and  so  the  initial  mean 
and  variances  for  Gaussians  often  arc  just  set  identically  for  all  states  by 
using  the  mean  and  variances  of  the  entire  training  set. 

Now  we  have  initial  estimates  for  the  a and  b probabilities.  The  next 
stage  of  the  algorithm  differs  for  Gaussian  and  MFP  systems.  For  MFP  sys- 
tems we  apply  what  is  called  a forced  Viterbi  alignment.  A forced  Viterbi  viterIi3 
alignment  takes  as  input  the  correct  words  in  an  utterance,  along  with  the 
spectral  feature  vectors.  It  produces  the  best  sequence  of  HMM  states,  with 
each  state  aligned  with  the  feature  vectors.  A forced  Viterbi  is  thus  a simpli- 
fication of  the  regular-  Viterbi  decoding  algorithm,  since  it  only  has  to  figure 
out  the  correct  phone  sequence,  but  doesn’t  have  to  discover  the  word  se- 
quence. It  is  called  forced  because  we  constrain  the  algorithm  by  requiring 
the  best  path  to  go  through  a particular-  sequence  of  words.  It  still  requires 
the  Viterbi  algorithm  since  words  have  multiple  pronunciations,  and  since 
the  duration  of  each  phone  is  not  fixed.  The  result  of  the  forced  Viterbi  is  a 
set  of  features  vectors  with  ‘correct’  phone  labels,  which  can  then  be  used  to 
retrain  the  neural  network.  The  counts  of  the  transitions  which  are  taken  in 
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the  forced  alignments  can  be  used  to  estimate  the  HMM  transition  probabil- 
ities. 

For  the  Gaussian  HMMs,  instead  of  using  forced  Viterbi,  we  use  the 
forw  ard-backw  ard  algorithm  described  in  Appendix  D.  We  compute  the  for- 
ward and  backward  probabilities  for  each  sentence  given  the  initial  a and 
b probabilities,  and  use  them  to  re-estimate  the  a and  b probabilities.  Just 
as  for  the  MLP  situation,  the  forw  ard-backw  ard  algorithm  needs  to  be  con- 
strained by  our  knowledge  of  the  correct  words.  The  forw  ard-backw  ard  al- 
gorithm computes  its  probabilities  given  a model  A,.  We  use  the  ‘known’ 
words  sequence  in  a transcribed  sentence  to  tell  us  which  word  models  to 
string  together  to  get  the  model  A,  that  we  use  to  compute  the  forward  and 
backw  ard  probabilities  for  each  sentence. 


7.8  Waveform  Generation  for  Speech  Synthesis 

Now  that  we  have  covered  acoustic  processing  we  can  return  to  the  acoustic 
component  of  a text-to-speech  (TTS)  system.  Recall  from  Chapter  4 that  the 
output  of  the  linguistic  processing  component  of  a TTS  system  is  a sequence 
of  phones,  each  with  a duration,  and  a FO  contour  which  specifies  the  pitch. 
target  This  specification  is  often  called  the  target,  as  it  is  this  that  we  want  the 
synthesizer  to  produce. 

The  most  commonly  used  type  of  algorithm  works  by  waveform  con- 
concatena-  catenation.  Such  concatenative  synthesis  is  based  on  a database  of  speech 
that  has  been  recorded  by  a single  speaker.  This  database  is  then  segmented 
into  a number  of  short  units,  which  can  be  phones,  diphones,  syllables,  words 
or  other  units.  The  simplest  sort  of  synthesizer  would  have  phone  units  and 
the  database  would  have  a single  unit  for  each  phone  in  the  phone  inventory. 
By  selecting  units  appropriately,  we  can  generate  a series  of  units  which 
match  the  phone  sequence  in  the  input.  By  using  signal  processing  to  smooth 
joins  at  the  unit  edges,  we  can  simply  concatenate  the  waveforms  for  each  of 
these  units  to  form  a single  synthetic  speech  waveform. 

Experience  has  shown  that  single  phone  concatenative  systems  don’t 
produce  good  quality  speech.  Just  as  in  speech  recognition,  the  context  of 
the  phone  plays  an  important  role  in  its  acoustic  pattern  and  hence  a /t/  before 
a /a/  sounds  very  different  from  a /t/  before  an  /s/. 

The  triphone  models  described  in  Figure  7.11  on  page  249  are  a pop- 
ular- choice  of  unit  in  speech  recognition,  because  they  cover  both  the  left 
and  right  contexts  of  a phone.  Unfortunately,  a language  typically  has  a 
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very  large  number  of  triphones  (tens  of  thousands)  and  it  is  currently  pro- 
hibitive to  collect  so  many  units  for  speech  synthesis.  Hence  diphones  arc 
often  used  in  speech  synthesis  as  they  provide  a reasonable  balance  between 
context-dependency  and  size  (typically  1000-2000  in  a language).  In  speech 
synthesis,  diphone  units  normally  start  half-way  through  the  first  phone  and 
end  half-way  through  the  second.  This  is  because  it  is  known  that  phones  arc 
more  stable  in  the  middle  than  at  the  edges,  so  that  the  middles  of  most  /a/ 
phones  in  a diphone  arc  reasonably  similar,  even  if  the  acoustic  patterns  start 
to  differ  substantially  after  that.  If  diphones  arc  concatenated  in  the  middles 
of  phones,  the  discontinuities  between  adjacent  units  arc  often  negligible. 

Pitch  and  Duration  Modification 

The  diphone  synthesizer  as  just  described  will  produce  a reasonable  qual- 
ity speech  waveform  corresponding  to  the  requested  phone  sequence.  But 
the  pitch  and  duration  (i.e.  the  prosody)  of  each  phone  in  the  concatenated 
waveform  will  be  the  same  as  when  the  diphones  were  recorded  and  will  not 
correspond  to  the  pitch  and  durations  requested  in  the  input.  The  next  stage 
of  the  synthesis  process  therefore  is  to  use  signal  processing  techniques  to 
change  the  prosody  of  the  concatenated  waveform. 

The  lineal-  prediction  (LPC)  model  described  earlier  can  be  used  for 
prosody  modification  as  it  explicitly  separates  the  pitch  of  a signal  from  its 
spectral  envelope  If  the  concatenated  waveform  is  represented  by  a sequence 
of  lineal'  prediction  coefficients,  a set  of  pulses  can  be  generated  correspond- 
ing to  the  desired  pitch  and  used  to  re-excite  the  coefficients  to  produce  a 
speech  waveform  again.  By  contracting  and  expanding  frames  of  coeffi- 
cients, the  duration  can  be  changed.  While  linear  prediction  produces  the 
correct  F0  and  durations  it  produces  a somewhat  “buzzy”  speech  signal. 

Another  technique  for  achieving  the  same  goal  is  the  time-domain 
pitch-synchronous  overlap  and  add  (TD-PSOLA)  technique.  TD-PSOLA 
works  pitch-synchronously  in  that  each  frame  is  centered  around  a pitch- 
mark  in  the  speech,  rather  than  at  regular  intervals  as  in  normal  speech  sig- 
nal processing.  The  concatenated  waveform  is  split  into  a number  of  frames, 
each  centered  around  a pitchmark  and  extending  a pitch  period  either  side. 
Prosody  is  changed  by  recombining  these  frames  at  a new  set  of  pitchmarks 
determined  by  the  requested  pitch  and  duration  of  the  input.  The  synthetic 
waveform  is  created  by  simply  overlapping  and  adding  the  frames.  Pitch  is 
increased  by  making  the  new  pitchmarks  closer  together  (shorter  pitch  peri- 
ods implies  higher  frequency  pitch),  and  decreased  by  making  them  further 
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apart.  Speech  is  made  longer  by  duplication  frames  and  shorter  by  leav- 
ing frames  out.  The  operation  of  TD-PSOLA  can  be  compared  to  that  of  a 
tape  recorder  with  variable  speed  - if  you  play  back  a tape  faster  than  it  was 
recorded,  the  pitch  periods  will  come  closer  together  and  hence  the  pitch 
will  increase.  But  speeding  up  a tape  recording  effectively  increases  the  fre- 
quency of  all  the  components  of  the  speech  (including  the  formants  which 
characterize  the  vowels)  and  will  give  the  impression  of  a “squeaky”,  unnat- 
ural voice.  TD-PSOLA  differs  because  it  separates  each  frame  first  and  then 
decreases  the  distance  between  the  frames.  Because  the  internals  of  each 
frame  aren't  changed,  the  frequency  of  the  non-pitch  components  is  ha  idly 
altered,  and  the  resultant  speech  sounds  the  same  as  the  original  except  with 
a different  pitch. 

Unit  Selection 

While  signal  processing  and  diphone  concatenation  can  produce  reasonable 
quality  speech,  the  result  is  not  ideal.  There  arc  a number  of  reasons  for  this, 
but  they  all  boil  down  to  the  fact  that  having  a single  example  of  each  diphone 
is  not  enough.  First  of  all,  signal  processing  inevitably  incurs  distortion, 
and  the  quality  of  the  speech  gets  worse  when  the  signal  processing  has  to 
stretch  the  pitch  and  duration  by  large  amounts.  Furthermore,  there  arc  many 
other  subtle  effects  which  are  outside  the  scope  of  most  signal  processing 
algorithms.  For  instance,  the  amount  of  vocal  effort  decreases  over  time  as 
the  utterance  is  spoken,  producing  weaker  speech  at  the  end  of  the  utterance. 
If  diphones  arc  taken  from  near  the  staid  of  an  utterance,  they  will  sound 
unnatural  in  phrase-final  positions. 

Unit-selection  synthesis  is  an  attempt  to  address  this  problem  by  col- 
lecting several  examples  of  each  unit  at  different  pitches  and  durations  and 
linguistic  situations,  so  that  the  unit  is  close  to  the  target  in  the  first  place 
and  hence  the  signal  processing  needs  to  do  less  work.  One  technique  for 
unit-selection  (Hunt  and  Black,  1996)  works  as  follows: 

The  input  to  the  algorithm  is  the  same  as  other  concatenative  synthe- 
sizers, with  the  addition  that  the  FO  contour  is  now  specified  as  three  FO 
values  per  phone,  rather  than  as  a contour.  The  technique  uses  phones  as 
its  units,  indexing  phones  in  a large  database  of  naturally  occurring  speech 
Each  phone  in  the  database  is  also  marked  with  a duration  and  three  pitch 
values.  The  algorithm  works  in  two  stages.  First,  for  each  phone  in  the  target 
word,  a set  of  candidate  units  which  match  closely  in  terms  of  phone  identity, 
duration  and  FO  is  selected  from  the  database.  These  candidates  arc  ranked 
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using  a target  cost  function,  which  specifies  just  how  close  each  unit  actu- 
ally is  to  the  target.  The  second  paid  of  the  algorithm  works  by  measuring 
how  well  each  candidate  for  each  unit  joins  with  its  neighbor’s  candidates. 
Various  locations  for  the  joins  arc  assessed,  which  allows  the  potential  for 
units  to  be  joined  in  the  middle,  as  with  diphones.  These  potential  joins  arc 
ranked  using  a concatenation  cost  function.  The  final  step  is  to  pick  the  best 
set  of  units  which  minimize  the  overall  target  and  concatenation  cost  for  the 
whole  sentence.  This  step  is  performed  using  the  Viterbi  algorithm  in  a sim- 
ilar- way  to  HMM  speech  recognition:  here  the  target  cost  is  the  observation 
probability  and  the  concatenation  cost  is  the  transition  probability. 

By  using  a much  larger  database  which  contains  many  examples  of 
each  unit,  unit-selection  synthesis  often  produces  more  natural  speech  than 
straight  diphone  synthesis.  Some  systems  then  use  signal  processing  to  make 
sure  the  prosody  matches  the  target,  while  others  simply  concatenate  the 
units  following  the  idea  that  a utterance  which  only  roughly  matches  the 
target  is  better  than  one  that  exactly  matches  it  but  also  has  some  signal 
processing  distortion. 


7.9  Human  Speech  Recognition 

Speech  recognition  in  humans  shares  some  features  with  the  automatic  speech 
recognition  models  we  have  presented.  We  mentioned  above  that  signal  pro- 
cessing algorithms  like  PLP  analysis  (Hermansky,  1990)  were  in  fact  in- 
spired by  properties  of  the  human  auditory  system.  In  addition,  four  proper- 
ties of  human  lexical  access  (the  process  of  retrieving  a word  from  the  men-  access 
tal  lexicon)  are  also  true  of  ASR  models:  frequency,  parallelism,  neigh- 
borhood effects,  and  cue-based  processing.  For  example,  as  in  ASR  with 
its  V-gram  language  models,  human  lexical  access  is  sensitive  to  word  fre- 
quency, High-frequency  spoken  words  are  accessed  faster  or  with  less  in- 
formation than  low-frequency  words.  They  are  successfully  recognized  in 
noisier  environments  than  low  frequency  words,  or  when  only  parts  of  the 
words  are  presented  (Howes,  1957;  Grosjean,  1980;  Tyler,  1984,  inter  alia). 

Like  ASR  models,  human  lexical  access  is  parallel:  multiple  words  are  ac- 
tive at  the  same  time  (Marslen-Wilson  and  Welsh,  1978;  Salasoo  and  Pisoni, 

1985,  inter  alia).  Human  lexical  access  exhibits  neighborhood  effects  (the 
neighborhood  of  a word  is  the  set  of  words  which  closely  resemble  it). 

Words  with  large  frequency-weighted  neighborhoods  are  accessed  slower 
than  words  with  less  neighbors  (Luce  et  at,  1990).  Jurafsky  (1996)  shows 
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that  the  effect  of  neighborhood  on  access  can  be  explained  by  the  Bayesian 
models  used  in  ASR. 

Finally,  human  speech  perception  is  cue-based:  speech  input  is  inter- 
preted by  integrating  cues  at  many  different  levels.  For  example,  there  is 
evidence  that  human  perception  of  individual  phones  is  based  on  the  inte- 
gration of  multiple  cues,  including  acoustic  cues,  such  as  formant  structure 
or  the  exact  timing  of  voicing,  (Oden  and  Massaro,  1978;  Miller,  1994),  vi- 
sual cues,  such  as  lip  movement  (Massaro  and  Cohen,  1983;  Massaro,  1998), 
and  lexical  cues  such  as  the  identity  of  the  word  in  which  the  phone  is  placed 
(Warren,  1970;  Samuel,  1981;  Connine  and  Clifton,  1987;  Connine,  1990). 
For  example,  in  what  is  often  called  the  phoneme  restoration  effect,  Warren 
(1970)  took  a speech  sample  and  replaced  one  phone  (e.g.  the  [s]  in  legisla- 
ture) with  a cough.  Warren  found  that  subjects  listening  to  the  resulting  tape 
typically  heard  the  entire  word  legislature  including  the  [s],  and  perceived 
the  cough  as  background.  Other  cues  in  human  speech  perception  include 
semantic  word  association  (words  arc  accessed  more  quickly  if  a semanti- 
cally related  word  has  been  heard  recently)  and  repetition  priming  (words 
arc  accessed  more  quickly  if  they  themselves  have  just  been  heard).  The 
intuitions  of  both  of  these  results  arc  incorporated  into  recent  language  mod- 
els discussed  in  Chapter  6,  such  as  the  cache  model  of  Kuhn  and  de  Mori 
(1990),  which  models  repetition  priming,  or  the  trigger  model  of  Rosenfeld 
(1996)  and  the  LSA  models  of  Coccaro  and  Jurafsky  (1998)  and  Bellegarda 
(1999),  which  model  word  association.  In  a fascinating  reminder  that  good 
ideas  arc  never  discovered  only  once.  Cole  and  Rudnicky  (1983)  point  out 
that  many  of  these  insights  about  context  effects  on  word  and  phone  pro- 
cessing were  actually  discovered  by  William  Bagley  (Bagley,  1901).  Bagley 
achieved  his  results,  including  an  early  version  of  the  phoneme  restoration 
effect,  by  recording  speech  on  Edison  phonograph  cylinders,  modifying  it, 
and  presenting  it  to  subjects.  Bagley’s  results  were  forgotten  and  only  redis- 
covered much  later. 

One  difference  between  current  ASR  models  and  human  speech  recog- 
nition is  the  time-course  of  the  model.  It  is  important  for  the  performance  of 
the  ASR  algorithm  that  the  the  decoding  search  optimizes  over  the  entire  ut- 
terance. This  means  that  the  best  sentence  hypothesis  returned  by  a decoder 
at  the  end  of  the  sentence  may  be  very  different  than  the  current-best  hy- 
pothesis, half  way  into  the  sentence.  By  contrast,  there  is  extensive  evidence 
that  human  processing  is  on-line:  people  incrementally  segment  and  utter- 
ance into  words  and  assign  it  an  interpretation  as  they  hear  it.  For  example, 
Marslen-Wilson  (1973)  studied  close  shadowers:  people  who  arc  able  to 
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shadow  (repeat  back)  a passage  as  they  hear  it  with  lags  as  short  as  250  ms. 
Marslen-Wilson  found  that  when  these  shadowers  made  errors,  they  were 
syntactically  and  semantically  appropriate  with  the  context,  indicating  that 
word  segmentation,  parsing,  and  interpretation  took  place  within  these  250 
ms.  Cole  (1973)  and  Cole  and  Jakimik  (1980)  found  similar  effects  in  their 
work  on  the  detection  of  mispronunciations.  These  results  have  led  psy- 
chological models  of  human  speech  perception  (such  as  the  Cohort  model 
(Marslen-Wilson  and  Welsh,  1978)  and  the  computational  TRACE  model 
(McClelland  and  Elman,  1986))  to  focus  on  the  time-course  of  word  selec- 
tion and  segmentation.  The  TRACE  model,  for  example,  is  a connectionist  tionistc 
or  neural  network  interactive-activation  model,  based  on  independent  com-  network 
putational  units  organized  into  three  levels:  feature,  phoneme,  and  word. 

Each  unit  represents  a hypothesis  about  its  presence  in  the  input.  Units  arc 
activated  in  parallel  by  the  input,  and  activation  flows  between  units;  con- 
nections between  units  on  different  levels  arc  excitatory,  while  connections 
between  units  on  single  level  arc  inhibitatory.  Thus  the  activation  of  a word 
slightly  inhibits  all  other  words. 

We  have  focused  on  the  similarities  between  human  and  machine  speech 
recognition;  there  arc  also  many  differences.  In  particular,  many  other  cues 
have  been  been  shown  to  play  a role  in  human  speech  recognition  but  have 
yet  to  be  successfully  integrated  into  ASR.  The  most  important  class  of  these 
missing  cues  is  prosody.  To  give  only  one  example,  Cutler  and  Norris  (1988), 

Cutler  and  Carter  (1987)  note  that  most  multisyllabic  English  word  tokens 
have  stress  on  the  initial  syllable,  suggesting  in  their  metrical  segmentation 
strategy  (MSS)  that  stress  should  be  used  as  a cue  for  word  segmentation. 


7.10  Summary 

Together  with  chapters  4,  5,  and  6,  this  chapter  introduced  the  fundamental 
algorithms  for  addressing  the  problem  of  Large  Vocabulary  Continuous 
Speech  Recognition  and  Text-To-Speech  synthesis. 

• The  input  to  a speech  recognizer  is  a series  of  acoustic  waves.  The 
waveform,  spectrogram  and  spectrum  are  among  the  visualization 
tools  used  to  understand  the  information  in  the  signal. 

• In  the  first  step  in  speech  recognition,  wound  waves  arc  sampled, 
quantized,  and  converted  to  some  sort  of  spectral  representation;  A 
commonly  used  spectral  representation  is  the  LPC  cepstrum,  which 
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provides  a vector  of  features  for  each  time-slice  of  the  input. 

• These  feature  vectors  are  used  to  estimate  the  phonetic  likelihoods 
(also  called  observation  likelihoods)  either  by  a mixture  of  Gaussian 
estimators  or  by  a neural  net. 

• Decoding  or  search  is  the  process  of  finding  the  optimal  sequence  of 
model  states  which  matches  a sequence  of  input  observations.  (The 
fact  that  arc  two  terms  for  this  process  is  a hint  that  speech  recogni- 
tion is  inherently  inter-disciplinary,  and  draws  its  metaphors  from  more 
than  one  field;  decoding  comes  from  information  theory,  and  search 
from  artificial  intelligence). 

• We  introduced  two  decoding  algorithms:  time-synchronous  Viterbi 
decoding  (which  is  usually  implemented  with  pruning  and  can  then  be 
called  beam  search)  and  stack  or  A*  decoding.  Both  algorithms  take 
as  input  a series  of  feature  vectors,  and  2 ancillary  algorithms:  one  for 
assigning  likelihoods  (e.g.  Gaussians  or  MLP)  and  one  for  assigning 
priors  (e.g.  an  /V-gram  language  model).  Both  give  as  output  a string 
of  words. 

• The  embedded  training  paradigm  is  the  normal  method  for  training 
speech  recognizers.  Given  an  initial  lexicon  with  hand-built  pronunci- 
ation structures,  it  will  train  the  HMM  transition  probabilities  and  the 
HMM  observation  probabilities.  This  HMM  observation  probability 
estimation  can  be  done  via  a Gaussian  or  an  MLR 

• One  way  to  implement  the  acoustic  component  of  a TTS  system  is  with 
concatenative  synthesis,  in  which  an  utterance  is  built  by  concatenat- 
ing and  then  smoothing  diphones  taken  from  a large  database  of  speech 
recorded  by  a single  speaker. 


Bibliographical  and  Historical  Notes 

The  first  machine  which  recognized  speech  was  probably  a commercial  toy 
named  “Radio  Rex”  which  was  sold  in  the  1920's.  Rex  was  a celluloid  dog 
which  moved  (via  a spring)  when  the  spring  was  released  by  500  Hz  acoustic 
energy.  Since  500  Hz  is  roughly  the  first  formant  of  the  vowel  in  “Rex”,  the 
dog  seemed  to  come  when  he  was  called  (David  and  Selfridge,  1962). 

By  the  late  1940's  and  early  1950’s,  a number  of  machine  speech 
recognition  systems  had  been  built.  An  early  Bell  Labs  system  could  rec- 
ognize any  of  the  10  digits  from  a single  speaker  (Davis  et  ah,  1952).  This 


Section  7.10.  Summary 


279 


system  had  10  speaker-dependent  stored  patterns,  one  for  each  digit,  each  of 
which  roughly  represented  the  first  two  vowel  formants  in  the  digit.  They 
achieved  97-99%  accuracy  by  choosing  the  pattern  which  had  the  highest 
relative  correlation  coefficient  with  the  input.  Fry  (1959)  and  Denes  (1959) 
built  a phoneme  recognizer  at  University  College,  London,  which  recognized 
four  vowels  and  nine  consonants  based  on  a similar  pattern-recognition  prin- 
ciple. Fry  and  Denes’s  system  was  the  first  to  use  phoneme  transition  prob- 
abilities to  constrain  the  recognizer. 

The  late  1960s  and  early  1970’s  produced  a number  of  important  para- 
digm shifts.  First  were  a number  of  feature-extraction  algorithms,  include 
the  efficient  Fast  Fourier  Transform  (FFT)  (Cooley  and  Tukey,  1965),  the 
application  of  cepstral  processing  to  speech  (Oppenheim  el  ah,  1968),  and 
the  development  of  LPC  for  speech  coding  (Atal  and  Hanauer,  1971).  Sec- 
ond were  a number  of  ways  of  handling  warping;  stretching  or  shrinking  warping 
the  input  signal  to  handle  differences  in  speaking  rate  and  segment  length 
when  matching  against  stored  patterns.  The  natural  algorithm  for  solving 
this  problem  was  dynamic  programming,  and,  as  we  saw  in  Chapter  5,  the 
algorithm  was  reinvented  multiple  times  to  address  this  problem.  The  first 
application  to  speech  processing  was  by  Vintsyuk  (1968),  although  his  re- 
sult was  not  picked  up  by  other  researchers,  and  was  reinvented  by  Velichko 
and  Zagoruyko  (1970)  and  Sakoe  and  Chiba  (1971)  (and  (1984)).  Soon  af- 
terwards, Itakura  (1975)  combined  this  dynamic  programming  idea  with  the 
LPC  coefficients  that  had  previously  been  used  only  for  speech  coding.  The 
resulting  system  extracted  LPC  features  for  incoming  words  and  used  dy- 
namic programming  to  match  them  against  stored  LPC  templates. 

The  third  innovation  of  this  period  was  the  rise  of  the  HMM.  Hid- 
den Markov  Models  seem  to  have  been  applied  to  speech  independently 
at  two  laboratories  around  1972.  One  application  arose  from  the  work  of 
statisticians,  in  particular  Baum  and  colleagues  at  the  Institute  for  Defense 
Analyses  in  Princeton  on  HMMs  and  their  application  to  various  predic- 
tion problems  (Baum  and  Petrie,  1966;  Baum  and  Eagon,  1967).  James 
Baker  learned  of  this  work  and  applied  the  algorithm  to  speech  process- 
ing (Baker,  1975)  during  his  graduate  work  at  CMU.  Independently,  Freder- 
ick Jelinek,  Robert  Mercer,  and  Lalit  Bahl  (drawing  from  their  research  in 
information-theoretical  models  influenced  by  the  work  of  Shannon  (1948)) 
applied  HMMs  to  speech  at  the  IBM  Thomas  J.  Watson  Research  Center 
(Jelinek  et  ah,  1975).  IBM’s  and  Baker’s  systems  were  very  similar,  par- 
ticularly in  their  use  of  the  Bayesian  framework  described  in  this  chapter. 

One  early  difference  was  the  decoding  algorithm;  Baker’s  DRAGON  system 
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used  Viterbi  (dynamic  programming)  decoding,  while  the  IBM  system  ap- 
plied Jelinek’s  stack  decoding  algorithm  (Jelinek,  1969).  Baker  then  joined 
the  IBM  group  for  a brief  time  before  founding  the  speech-recognition  com- 
pany Dragon  Systems.  The  HMM  approach  to  speech  recognition  would 
turn  out  to  completely  dominate  the  field  by  the  end  of  the  century;  indeed 
the  IBM  lab  was  the  driving  force  in  extending  statistical  models  to  natural 
language  processing  as  well,  including  the  development  of  class-based  N- 
grams,  HMM-based  part-of-speech  tagging,  statistical  machine  translation, 
and  the  use  of  entropy/perplexity  as  an  evaluation  metric. 

The  use  of  the  HMM  slowly  spread  through  the  speech  community. 
One  cause  was  a number  of  research  and  development  programs  sponsored 
by  the  Advanced  Research  Projects  Agency  of  the  U.S.  Department  of  De- 
fense (ARPA).  The  first  five -year  program  starting  in  1971,  and  is  reviewed 
in  Klatt  (1977).  The  goal  of  this  first  program  was  to  build  speech  under- 
standing systems  based  on  a few  speakers,  a constrained  grammar  and  lexi- 
con (1000  words),  and  less  than  10%  semantic  error  rate.  Four  systems  were 
funded  and  compared  against  each  other:  the  System  Development  Corpo- 
ration (SDC)  system.  Bolt,  Beranek  & Newman  (BBN)’s  HWIM  system, 
Carnegie-Mellon  University’s  Hearsay-II  system,  and  Carnegie-Mellon’s  Harpy 
system  (Lowerre,  1968).  The  Harpy  system  used  a simplified  version  of 
Baker’s  HMM-based  DRAGON  system  and  was  the  best  of  the  tested  sys- 
tems, and  according  to  Klatt  the  only  one  to  meet  the  original  goals  of  the 
ARPA  project  (with  a semantic  error  rate  of  94%  on  a simple  task). 

Beginning  in  the  mid-80’s,  ARPA  funded  a number  of  new  speech 
research  programs.  The  first  was  the  “Resource  Management”  (RM)  task 
(Price  el  al.,  1988),  which  like  the  earlier  ARPA  task  involved  transcrip- 
tion (recognition)  of  read-speech  (speakers  reading  sentences  constructed 
from  a 1000- word  vocabulary)  but  which  now  included  a component  that 
involved  speaker-independent  recognition.  Later  tasks  included  recognition 
of  sentences  read  from  the  Wall  Street  Journal  (WSJ)  beginning  with  limited 
systems  of  5,000  words,  and  finally  with  systems  of  unlimited  vocabulary 
(in  practice  most  systems  use  approximately  60,000  words).  Later  speech- 
recognition  tasks  moved  away  from  read-speech  to  more  natural  domains; 
the  Broadcast  News  (also  called  Hub-4)  domain  (LDC,  1998;  Graff,  1997) 
(transcription  of  actual  news  broadcasts,  including  quite  difficult  passages 
such  as  on-the-street  interviews)  and  the  CALLHOME  and  CALLFRIEND 
domain  (LDC,  1999)  (natural  telephone  conversations  between  friends),  paid 
of  what  was  also  called  Hub-5.  The  Air  Traffic  Information  System  (ATIS) 
task  (Hemphill  et  al,  1990)  was  a speech  understanding  task  whose  goal 
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was  to  simulate  helping  a user  book  a flight,  by  answering  questions  about 
potential  airlines,  times,  dates,  etc. 

Each  of  the  ARPA  tasks  involved  an  approximately  annual  bake-off  at  bake-off 
which  all  ARPA-funded  systems,  and  many  other  ‘volunteer’  systems  from 
North  American  and  Europe,  were  evaluated  against  each  other  in  terms  of 
word  error  rate  or  semantic  error  rate.  In  the  early  evaluations,  for-profit  cor- 
porations did  not  generally  compete,  but  eventually  many  (especially  IBM 
and  ATT)  competed  regularly.  The  ARPA  competitions  resulted  in  widescale 
borrowing  of  techniques  among  labs,  since  it  was  easy  to  see  which  ideas 
had  provided  an  error-reduction  the  previous  year,  and  were  probably  an  im- 
portant factor  in  the  eventual  spread  of  the  HMM  paradigm  to  virtual  every 
major  speech  recognition  lab.  The  ARPA  program  also  resulted  in  a number 
of  useful  databases,  originally  designed  for  training  and  testing  systems  for 
each  evaluation  (TIMIT,  RM,  WSJ,  ATIS,  BN,  CALLHOME,  Switchboard) 
but  then  made  available  for  general  research  use. 

There  are  a number  of  textbooks  on  speech  recognition  that  arc  good 
choices  for  readers  who  seek  a more  in-depth  understanding  of  the  material 
in  this  chapter:  Jelinek  (1997),  Gold  and  Morgan  (1999),  and  Rabiner  and 
Juang  (1993)  arc  the  most  comprehensive.  The  last  two  textbooks  also  have 
comprehensive  discussions  of  the  history  of  the  field,  and  together  with  the 
survey  paper  of  Levinson  (1995)  have  influenced  our  short  history  discussion 
in  this  chapter.  Our  description  of  the  forward-backward  algorithm  was  mod- 
eled after  Rabiner  (1989).  Another  useful  tutorial  paper  is  Knill  and  Young 
(1997).  Research  in  the  speech  recognition  field  often  appeal's  in  the  pro- 
ceedings of  the  biennial  EUROSPEECH  Conference  and  the  International 
Conference  on  Spoken  Language  Processing  (ICSLP),  held  in  alternating 
years,  as  well  as  the  annual  IEEE  International  Conference  on  Acoustics, 

Speech,  and  Signal  Processing  (ICASSP).  Journals  include  Speech  Com- 
munication, Computer  Speech  and  Language,  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence,  and  IEEE  Transactions  on  Acoustics, 

Speech,  and  Signal  Processing. 


Exercises 

7.1  Analyze  each  of  the  errors  in  the  incorrectly  recognized  transcription 
of  “um  the  phone  is  I left  the. . . ” on  page  269.  For  each  one,  give  your  best 
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guess  as  to  whether  you  think  it  is  caused  by  a problem  in  signal  process- 
ing, pronunciation  modeling,  lexicon  size,  language  model,  or  pruning  in  the 
decoding  search. 

7.2  In  practice,  speech  recognizers  do  all  their  probability  computation  us- 
logprob  ing  the  log  probability  (or  logprob)  rather  than  actual  probabilities.  This 

helps  avoid  underflow  for  very  small  probabilities,  but  also  makes  the  Viterbi 
algorithm  very  efficient,  since  all  probability  multiplications  can  be  imple- 
mented by  adding  log  probabilities.  Rewrite  the  pseudocode  for  the  Viterbi 
algorithm  in  Figure  7.9  on  page  247  to  make  use  of  logprobs  instead  of  prob- 
abilities. 

7.3  Now  modify  the  Viterbi  algorithm  in  Figure  7.9  on  page  247  to  im- 
plement the  beam  search  described  on  page  249.  Hint:  You  will  probably 
need  to  add  in  code  to  check  whether  a given  state  is  at  the  end  of  a word  or 
not. 

7.4  Finally,  modify  the  Viterbi  algorithm  in  Figure  7.9  on  page  247  with 
more  detailed  pseudocode  implementing  the  array  of  backtrace  pointers. 

7.5  Implement  the  Stack  decoding  algorithm  of  Figure  7. 14  on  254.  Pick 
a very  simple  h*  function  like  an  estimate  of  the  number  of  words  remaining 
in  the  sentence. 

7.6  Modify  the  forward  algorithm  of  Figure  5.16  to  use  the  tree-structured 
lexicon  of  Figure  7.18  on  page  257. 


Part  II 

SYNTAX 


If  words  are  the  foundation  of  speech  and  language  processing,  syn- 
tax is  the  skeleton.  Syntax  is  the  study  of  formal  relationships  be- 
tween words.  These  six  chapters  study  how  words  are  clustered  into 
classes  called  parts-of-speech,  how  they  group  with  their  neighbors 
into  phrases,  and  the  way  words  depends  on  other  words  in  a sentence. 
The  section  explores  computational  models  of  all  of  these  kinds  of 
knowledge,  including  context-free  grammars,  lexicalized  grammars, 
feature  structures,  and  metatheoretical  issues  like  the  Chomsky  hi- 
erarchy. It  introduces  fundamental  algorithms  for  dealing  with  this 
knowledge,  like  the  Earley  and  CYK  algorithms  for  parsing  and  the 
unification  algorithm  for  feature  combination.  It  also  includes  proba- 
bilistic models  of  this  syntactic  knowledge,  including  HMM  part-of- 
speech  taggers,  and  probabilistic  context-free  grammars.  Finally,  this 
section  will  explore  psychological  models  of  human  syntactic  pro- 
cessing. 


WORD  CLASSES  AND 

PART-OF-SPEECH 

TAGGING 


Conjunction  Junction,  what’ s your  function? 

Bob  Dorough,  Schoolhouse  Rock,  1973 

There  are  ten  parts  of  speech,  and  they  are  all  troublesome. 
Mark  Twain,  The  Awful  German  Language 


The  definitions  [of  the  parts  of  speech]  are  very  far  from  having 
attained  the  degree  of  exactitude  found  in  Euclidean  geometry. 
Otto  Jespersen,  The  Philosophy  of  Grammar,  1924 


Words  are  traditionally  grouped  into  equivalence  classes  called  parts  of 
speech  (POS;  Latin  pars  orationis ),  word  classes,  morphological  classes,  speechf 
or  lexical  tags.  In  traditional  grammars  there  were  generally  only  a few  parts  pos 
of  speech  (noun,  verb,  adjective,  preposition,  adverb,  conjunction,  etc.),  classes 
More  recent  models  have  much  larger  numbers  of  word  classes  (45  for  the 
Penn  Treebank  (Marcus  et  al.,  1993),  87  for  the  Brown  corpus  (Francis, 

1979;  Francis  and  Kucera,  1982),  and  146  for  the  C7  tagset  (Garside  et  ah, 

1997)). 

The  part  of  speech  for  a word  gives  a significant  amount  of  information 
about  the  word  and  its  neighbors.  This  is  clearly  true  for  major  categories, 

(verb  versus  noun),  but  is  also  true  for  the  many  finer  distinctions.  For  ex- 
ample these  tagsets  distinguish  between  possessive  pronouns  (my,  your,  his, 
her,  its ) and  personal  pronouns  (I,  you,  he,  me).  Knowing  whether  a word  is 
a possessive  pronoun  or  a personal  pronoun  can  tell  us  what  words  arc  likely 
to  occur  in  its  vicinity  (possessive  pronouns  arc  likely  to  be  followed  by  a 
noun,  personal  pronouns  by  a verb).  This  can  be  useful  in  a language  model 
for  speech  recognition. 
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A word’s  part-of-speech  can  tell  us  something  about  how  the  word  is 
pronounced.  As  Chapter  4 discussed,  the  word  content,  for  example,  can  be 
a noun  or  an  adjective.  They  arc  pronounced  differently  (the  noun  is  pro- 
nounced CONtent  and  the  adjective  conTENT).  Thus  knowing  the  paid  of 
speech  can  produce  more  natural  pronunciations  in  a speech  synthesis  sys- 
tem and  more  accuracy  in  a speech  recognition  system.  (Other  pairs  like  this 
include  OBject  (noun)  and  obJECT  (verb).  Discount  (noun)  and  disCOUNT 
(verb);  see  Cutler  (1986)). 

Parts  of  speech  can  also  be  used  in  stemming  for  informational  retrieval 
(IR),  since  knowing  a word’s  paid  of  speech  can  help  tell  us  which  morpho- 
logical affixes  it  can  take,  as  we  saw  in  Chapter  3.  They  can  also  help  an 
IR  application  by  helping  select  out  nouns  or  other  important  words  from  a 
document.  Automatic  part-of-speech  taggers  can  help  in  building  automatic 
word-sense  disambiguating  algorithms,  and  POS  taggers  arc  also  used  in  ad- 
vanced ASR  language  models  such  as  class-based  N-grams,  discussed  in 
Section  8.7.  Parts  of  speech  arc  very  often  used  for  ‘partial  parsing’  texts, 
for  example  for  quickly  finding  names  or  other  phrases  for  the  information 
extraction  applications  discussed  in  Chapter  15.  Finally,  corpora  that  have 
been  marked  for  part-of-speech  arc  very  useful  for  linguistic  research,  for 
example  to  help  find  instances  or  frequencies  of  particular  constructions  in 
large  corpora. 

The  remainder  of  this  chapter  begins  in  Section  8. 1 with  a summary  of 
English  word  classes,  followed  by  a description  in  Section  8.2  of  different 
tagsets  for  formally  coding  these  classes.  The  next  three  sections  then  in- 
troduces three  tagging  algorithms:  rule-based  tagging,  stochastic  tagging, 
and  transformation-based  tagging. 


8.1  (Mostly)  English  Word  Classes 

Well,  every  person  you  can  know, 

And  every  place  that  you  can  go, 

And  anything  that  you  can  show, 

You  know  they  ’re  nouns. 

Lynn  Ahrens,  Schoolhouse  Rock,  1973 


Until  now  we  have  been  using  part-of-speech  terms  like  noun  and  verb 
rather  freely.  In  this  section  we  give  a more  complete  definition  of  these 
and  other  classes.  Traditionally  the  definition  of  parts  of  speech  has  been 
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based  on  morphological  and  syntactic  function;  words  that  function  simi- 
larly with  respect  to  the  affixes  they  take  (their  morphological  properties)  or 
with  respect  to  what  can  occur  nearby  (their  ‘distributional  properties’)  arc 
grouped  into  classes.  While  word  classes  do  have  tendencies  toward  seman- 
tic coherence  (nouns  do  in  fact  often  describe  ‘people,  places  or  things’,  and 
adjectives  often  describe  properties),  this  is  not  necessarily  the  case,  and  in 
general  we  don’t  use  semantic  coherence  as  a definitional  criterion  for  parts 
of  speech. 

Parts  of  speech  can  be  divided  into  two  broad  supercategories:  closed 
class  types  and  open  class  types.  Closed  classes  arc  those  that  have  relatively 
fixed  membership.  For  example,  prepositions  arc  a closed  class  because 
there  is  a fixed  set  of  them  in  English;  new  prepositions  arc  rarely  coined.  By 
contrast  nouns  and  verbs  arc  open  classes  because  new  nouns  and  verbs  arc 
continually  coined  or  borrowed  from  other  languages  (e.g.  the  new  verb  to 
fax  or  the  borrowed  noun  futon).  It  is  likely  that  any  given  speaker  or  corpus 
will  have  different  open  class  words,  but  all  speakers  of  a language,  and 
corpora  that  arc  large  enough,  will  likely  share  the  set  of  closed  class  words. 
Closed  class  words  arc  generally  also  function  words;  function  words  arc 
grammatical  words  like  of,  it,  and,  or  you,  which  tend  to  be  very  short,  occur 
frequently,  and  play  an  important  role  in  grammar. 

There  arc  four  major  open  classes  that  occur  in  the  languages  of  the 
world:  nouns,  verbs,  adjectives,  and  adverbs.  It  turns  out  that  English  has 
ah  four  of  these,  although  not  every  language  does.  Many  languages  have  no 
adjectives.  In  the  native  American  language  Lakhota,  for  example,  and  also 
possibly  in  Chinese,  the  words  corresponding  to  English  adjectives  act  as  a 
subclass  of  verbs. 

Every  known  human  language  has  at  least  the  two  categories  noun  and 
verb  (although  in  some  languages,  for  example  Nootka,  the  distinction  is 
subtle).  Noun  is  the  name  given  to  the  lexical  class  in  which  the  words  for 
most  people,  places,  or  things  occur.  But  since  lexical  classes  like  noun  arc 
defined  functionally  (morphological  and  syntactically)  rather  than  seman- 
tically, some  words  for  people,  places,  and  things  may  not  be  nouns,  and 
conversely  some  nouns  may  not  be  words  for  people,  places,  or  things.  Thus 
nouns  include  concrete  terms  like  ship  and  chair,  abstractions  like  band- 
width and  relationship,  and  verb-like  terms  like  pacing  in  His  pacing  to  and 
fro  became  quite  annoying).  What  defines  a noun  in  English,  then,  arc  things 
like  its  ability  to  occur  with  determiners  (a  goat,  its  bandwidth,  Plato ’s  Re- 
public), to  take  possessives  ( IBM’s  annual  revenue),  and  for  most  but  not  all 
nouns,  to  occur  in  the  plural  form  {goats,  abaci). 
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Nouns  arc  traditionally  grouped  into  proper  nouns  and  common  nouns. 
Proper  nouns,  like  Regina,  Colorado,  and  IBM,  arc  names  of  specific  persons 
or  entities.  In  English,  they  generally  aren't  preceded  by  articles  (e.g.  the 
book  is  upstairs,  but  Regina  is  upstairs).  In  written  English,  proper  nouns 
arc  usually  capitalized. 

In  many  languages,  including  English,  common  nouns  arc  divided  into 
count  nouns  and  mass  nouns.  Count  nouns  arc  those  that  allow  gram- 
matical enumeration;  that  is,  they  can  occur  in  both  the  singular  and  plural 
{goat/goats,  relationship/relationships)  and  they  can  be  counted  ( one  goat, 
two  goats).  Mass  nouns  arc  used  when  something  is  conceptualized  as  a ho- 
mogeneous group.  So  words  like  snow,  salt,  and  communism  arc  not  counted 
(i.e.  *two  snows  or  *two  communisms).  Mass  nouns  can  also  appeal-  without 
articles  where  singular  count  nouns  cannot  ( Snow  is  white  but  not  *Goat  is 
white). 

The  verb  class  includes  most  of  the  words  referring  to  actions  and  pro- 
cesses, including  main  verbs  like  draw,  provide,  differ,  and  go.  As  we  saw 
in  Chapter  3,  English  verbs  have  a number  of  morphological  forms  (non- 
3rd-person-sg  {eat),  3d-person-sg  {eats),  progressive  {eating),  past  partici- 
ple eaten).  A subclass  of  English  verbs  called  auxiliaries  will  be  discussed 
when  we  turn  to  closed  class  forms. 

The  third  open  class  English  form  is  adjectives;  semantically  this  class 
includes  many  terms  that  describe  properties  or  qualities.  Most  languages 
have  adjectives  for  the  concepts  of  color  {white,  black),  age  {old,  young), 
and  value  {good,  bad),  but  there  are  languages  without  adjectives.  As  we 
discussed  above,  many  linguists  argue  that  the  Chinese  family  of  languages 
uses  verbs  to  describe  such  English-adjectival  notions  as  color  and  age. 

The  final  open  class  form,  adverbs,  is  rather  a hodge-podge,  both  se- 
mantically and  formally.  For  example  Schachter  (1985)  points  out  that  in  a 
sentence  like  the  following,  all  the  italicized  words  are  adverbs: 

Unfortunately,  John  walked  home  extremely  slowly  yesterday 

What  coherence  the  class  has  semantically  may  be  solely  that  each  of 
these  words  can  be  viewed  as  modifying  something  (often  verbs,  hence  the 
name  ‘adverb’,  but  also  other  adverbs  and  entire  verb  phrases).  Directional 
adverbs  or  locative  adverbs  {home,  here,  downhill)  specify  the  direction 
or  location  of  some  action;  degree  adverbs  {extremely,  very,  somewhat) 
specify  the  extent  of  some  action,  process,  or  property;  manner  adverbs 
{slowly,  slinkily,  delicately)  describe  the  manner  of  some  action  or  process; 
and  temporal  adverbs  describe  the  time  that  some  action  or  event  took  place 
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( yesterday , Monday).  Because  of  the  heterogeneous  nature  of  this  class, 
some  adverbs  (for  example  temporal  adverbs  like  Monday ) arc  tagged  in 
some  tagging  schemes  as  nouns. 

The  closed  classes  differ  more  from  language  to  language  than  do  the 
open  classes.  Here’s  a quick  overview  of  some  of  the  more  important  closed 
classes  in  English,  with  a few  examples  of  each: 

• prepositions:  on,  under,  over,  near,  by,  at,  from,  to,  with 

• determiners:  a,  an,  the 

• pronouns:  she,  who,  I,  others 

• conjunctions:  and,  but,  or,  as,  if,  when 

• auxiliary  verbs:  can,  may,  should,  are 

• particles:  up,  down,  on,  off,  in,  out,  at,  by, 

• numerals:  one,  two,  three,  first,  second,  third 

Prepositions  occur  before  noun  phrases;  semantically  they  arc  rela- 
tional, often  indicating  spatial  or  temporal  relations,  whether  literal  {on  it, 
before  then,  by  the  house)  or  metaphorical  {on  tune,  with  gusto,  beside  her- 
self). But  they  often  indicate  other  relations  as  well  ( Hamlet  was  written  by 
Shakespeare,  and  (from  Shakespeare)  “And  I did  laugh  sans  intermission  an 
hour  by  his  dial”).  Figure  8.1  shows  the  prepositions  of  English  according 
to  the  CELEX  on-line  dictionary  (Celex,  1993),  sorted  by  their  frequency  in 
the  COBUILD  16  million  word  corpus  of  English  (?).  Note  that  this  should 
not  be  considered  a definitive  list.  Different  dictionaries  and  different  tag 
sets  may  label  word  classes  differently.  This  list  combines  prepositions  and 
particles;  see  below  for  more  on  particles. 

A particle  is  a word  that  resembles  a preposition  or  an  adverb,  and  that 
often  combines  with  a verb  to  form  a larger  unit  called  a phrasal  verb,  as  in 
the  following  examples  from  Thoreau: 

So  I went  on  for  some  days  cutting  and  hewing  timber. . . 

Moral  reform  is  the  effort  to  throw  off  sleep. . . 

We  can  see  that  these  arc  particles  rather  than  prepositions,  for  in  the 
first  example,  on  is  followed,  not  by  a noun  phrase,  but  by  a true  preposition 
phrase.  With  transitive  phrasal  verbs,  as  in  the  second  example,  we  can  tell 
that  off  is  a particle  and  not  a preposition  because  particles  may  appeal-  after 
their  objects  ( throw  sleep  off  as  well  as  throw  off  sleep).  This  is  not  possible 
for  prepositions  (The  horse  went  off  its  track,  but  *The  horse  went  its  track 

off)- 
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ARTICLES 


of 

540,085 

through 

14,964 

worth 

1,563 

pace 

12 

in 

331,235 

after 

13,670 

toward 

1,390 

nigh 

9 

for 

142,421 

between 

13,275 

plus 

750 

re 

4 

to 

125,691 

under 

9,525 

till 

686 

mid 

3 

with 

124,965 

per 

6,515 

amongst 

525 

o’er 

2 

on 

109,129 

among 

5,090 

via 

351 

but 

0 

at 

100,169 

within 

5,030 

amid 

222 

ere 

0 

by 

77,794 

towards 

4,700 

underneath 

164 

less 

0 

from 

74,843 

above 

3,056 

versus 

113 

midst 

0 

about 

38,428 

near 

2,026 

amidst 

67 

o’ 

0 

than 

20,210 

off 

1,695 

sans 

20 

thru 

0 

over 

18,071 

past 

1,575 

circa 

14 

vice 

0 

Figure  8.1  Prepositions  (and  particles)  of  English  from  the  CELEX  on-line 
dictionary.  Frequency  counts  are  from  the  COBUILD  16  million  word  corpus. 


Quirk  et  al.  (1985a)  gives  the  following  list  of  single-word  particles. 
Since  it  is  extremely  hard  to  automatically  distinguish  particles  from  prepo- 
sitions, some  tag  sets  (like  the  one  used  for  CELEX)  do  not  distinguish  them, 
and  even  in  corpora  that  do  (like  the  Penn  Treebank)  the  distinction  is  very 
difficult  to  make  reliably  in  an  automatic  process,  so  we  do  not  give  counts. 


aboard 

aside 

besides 

forward(s) 

opposite 

through 

about 

astray 

between 

home 

out 

throughout 

above 

away 

beyond 

in 

outside 

together 

across 

back 

by 

inside 

over 

under 

ahead 

before 

close 

instead 

overhead 

underneath 

alongside 

behind 

down 

near 

past 

up 

apart 

below 

east,  etc 

off 

round 

within 

around 

beneath 

castward(s),etc 

on 

since 

without 

Figure  8.2  English  single-word  particles  from  Quirk  et  al.  (1985a) 


A particularly  small  closed  class  is  the  articles:  English  has  three:  a, 
an,  and  the  (although  this  (as  in  this  chapter)  and  that  (as  in  that  page ) arc 
often  included  as  well).  Articles  often  begin  a noun  phrase.  A and  an  mark  a 
noun  phrase  as  indefinite,  while  the  can  mark  it  as  definite.  We  will  discuss 
definiteness  in  Chapter  18.  Articles  arc  quite  frequent  in  English;  indeed 
the  is  the  most  frequent  word  in  most  English  corpora.  Here  are  COBUILD 
statistics,  again  out  of  16  million  words: 
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the  1,071,676 
a 413,887 
an  59,359 

Conjunctions  arc  used  to  join  two  phrases,  clauses,  or  sentences.  Co- 
ordinating conjunctions  like  and , or,  or  but,  join  two  elements  of  equal  sta- 
tus. Subordinating  conjunctions  arc  used  when  one  of  the  elements  is  of 
some  sort  of  embedded  status.  For  example  that  in  7 thought  that  you  might 
like  some  milk’  is  a subordinating  conjunction  that  links  the  main  clause  I 
thought  with  the  subordinate  clause  you  might  like  some  milk.  This  clause 
is  called  subordinate  because  this  entire  clause  is  the  ‘content’  of  the  main 
verb  thought.  Subordinating  conjunctions  like  that  which  link  a verb  to  its 
argument  in  this  way  are  also  called  complementizers.  Chapter  9 and  Chap- 
ter 11  will  discuss  complementation  in  more  detail.  Table  8.3  lists  English 
conjunctions. 


and 

514,946 

yet 

5,040 

considering 

174 

forasmuch  as 

0 

that 

134,773 

since 

4,843 

lest 

131 

however 

0 

but 

96,889 

where 

3,952 

albeit 

104 

immediately 

0 

or 

76,563 

nor 

3,078 

providing 

96 

in  as  far  as 

0 

as 

54,608 

once 

2,826 

whereupon 

85 

in  so  far  as 

0 

if 

53,917 

unless 

2,205 

seeing 

63 

inasmuch  as 

0 

when 

37,975 

why 

1,333 

directly 

26 

insomuch  as 

0 

because 

23,626 

now 

1,290 

ere 

12 

insomuch  that 

0 

SO 

12,933 

neither 

1,120 

notwithstanding 

3 

like 

0 

before 

10,720 

whenever 

913 

according  as 

0 

neither  nor 

0 

though 

10,329 

whereas 

867 

as  if 

0 

now  that 

0 

than 

9,511 

except 

864 

as  long  as 

0 

only 

0 

while 

8,144 

till 

686 

as  though 

0 

provided  that 

0 

after 

7,042 

provided 

594 

both  and 

0 

providing  that 

0 

whether 

5,978 

whilst 

351 

but  that 

0 

seeing  as 

0 

for 

5,935 

suppose 

281 

but  then 

0 

seeing  as  how 

0 

although 

5,424 

cos 

188 

but  then  again 

0 

seeing  that 

0 

until 

5,072 

supposing 

185 

either  or 

0 

without 

0 

Figure  8.3  Coordinating  and  subordinating  conjunctions  of  English  from 
the  CELEX  on-line  dictionary.  Frequency  counts  are  from  the  COBUILD  16 
million  word  corpus. 


Pronouns  arc  forms  that  often  act  as  a kind  of  shorthand  for  referring 
to  some  noun  phrase  or  entity  or  event.  Personal  pronouns  refer  to  per- 
sons or  entities  (you,  she,  I,  it,  me,  etc).  Possessive  pronouns  arc  forms  of 
personal  pronouns  that  indicate  either  actual  possession  or  more  often  just 
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an  abstract  relation  between  the  person  and  some  object  (my,  your,  his,  her, 
wh  its,  one’s,  our,  their).  Wh-pronouns  (what,  who,  whom,  whoever)  arc  used 

in  certain  question  forms,  or  may  also  act  as  complementizers  ( Frieda , who 
/ met  five  years  ago...).  Table  8.4  shows  English  pronouns,  again  from 
CELEX. 


it 

199,920 

how 

13,137 

yourself 

2,437 

no  one 

106 

I 

198,139 

another 

12,551 

why 

2,220 

wherein 

58 

he 

158,366 

where 

11,857 

little 

2,089 

double 

39 

you 

128,688 

same 

11,841 

none 

1,992 

thine 

30 

his 

99,820 

something 

11,754 

nobody 

1,684 

summat 

22 

they 

88,416 

each 

11,320 

further 

1,666 

suchlike 

18 

this 

84,927 

both 

10,930 

everybody 

1,474 

fewest 

15 

that 

82,603 

last 

10,816 

ourselves 

1,428 

thyself 

14 

she 

73,966 

every 

9,788 

mine 

1,426 

whomever 

11 

her 

69,004 

himself 

9,113 

somebody 

1,322 

whosoever 

10 

we 

64,846 

nothing 

9,026 

former 

1,177 

whomsoever 

8 

all 

61,767 

when 

8,336 

past 

984 

wherefore 

6 

which 

61,399 

one 

7,423 

plenty 

940 

whereat 

5 

their 

51,922 

much 

7,237 

either 

848 

whatsoever 

4 

what 

50,116 

anything 

6,937 

yours 

826 

whereon 

2 

my 

46,791 

next 

6,047 

neither 

618 

whoso 

2 

him 

45,024 

themselves 

5,990 

fewer 

536 

aught 

1 

me 

43,071 

most 

5,115 

hers 

482 

howsoever 

1 

who 

42,881 

itself 

5,032 

ours 

458 

thrice 

1 

them 

42,099 

myself 

4,819 

whoever 

391 

wheresoever 

1 

no 

33,458 

everything 

4,662 

least 

386 

you-all 

1 

some 

32,863 

several 

4,306 

twice 

382 

additional 

0 

other 

29,391 

less 

4,278 

theirs 

303 

anybody 

0 

your 

28,923 

herself 

4,016 

wherever 

289 

each  other 

0 

its 

27,783 

whose 

4,005 

oneself 

239 

once 

0 

our 

23,029 

someone 

3,755 

thou 

229 

one  another 

0 

these 

22,697 

certain 

3,345 

’un 

227 

overmuch 

0 

any 

22,666 

anyone 

3,318 

ye 

192 

such  and  such 

0 

more 

21,873 

whom 

3,229 

thy 

191 

whate’er 

0 

many 

17,343 

enough 

3,197 

whereby 

176 

whenever 

0 

such 

16,880 

half 

3,065 

thee 

166 

whereof 

0 

those 

15,819 

few 

2,933 

yourselves 

148 

whereto 

0 

own 

15,741 

everyone 

2,812 

latter 

142 

whereunto 

0 

us 

15,724 

whatever 

2,571 

whichever 

121 

whichsoever 

0 

Figure  8.4  Pronouns  of  English  from  the  CELEX  on-line  dictionary.  Fre- 
quency counts  are  from  the  COBUILD  16  million  word  corpus. 
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A closed  class  subtype  of  English  verbs  arc  the  auxiliary  verbs.  Cross-  auxiliary 
linguistically,  auxiliaries  arc  words  (usually  verbs)  that  mark  certain  seman- 
tic features  of  a main  verb,  including  whether  an  action  takes  place  in  the 
present,  past  or  future  (tense),  whether  it  is  completed  (aspect),  whether  it  is 
negated  (polarity),  and  whether  an  action  is  necessary,  possible,  suggested, 
desired,  etc.  (mood). 

English  auxiliaries  include  the  copula  verb  be,  the  two  verbs  do  and  copula 
have,  along  with  their  inflected  forms,  as  well  as  a class  of  modal  verbs.  Be  modal 
is  called  a copula  because  it  connects  subjects  with  certain  kinds  of  predicate 
nominals  and  adjectives  (He  is_  a duck).  The  verb  have  is  used  for  example 
to  mark  the  perfect  tenses  (7  have  gone,  I had  gone),  while  be  is  used  as  pari 
of  the  passive  (We  were  robbed),  or  progressive  (We  are  leaving)  construc- 
tions. The  modals  arc  used  to  mark  the  mood  associated  with  the  event  or 
action  depicted  by  the  main  verb.  So  can  indicates  ability  or  possibility,  may 
indicates  permission  or  possibility,  must  indicates  necessity,  etc.  Figure  8.5 
gives  counts  for  the  frequencies  of  the  modals  in  English.  In  addition  to 
the  copula  have  mentioned  above,  there  is  a modal  verb  have  (e.g.  I have 
to  go),  which  is  very  common  in  spoken  English.  Neither  it  nor  the  modal 
verb  dare,  which  is  very  rare,  have  frequency  counts  because  the  CELEX 
dictionary  does  not  distinguish  the  main  verb  sense  (I  have  three  oranges, 

He  dared  me  to  eat  them),  from  the  modal  sense  (There  has  to  be  some  mis- 
take, Dare  I confront  him  ?)  from  the  non-modal  auxiliary  verb  sense  (I  have 
never  seen  that). 


can 

70,930 

might 

5,580 

shouldn't 

858 

will 

69,206 

couldn’t 

4,265 

mustn't 

332 

may 

25,802 

shall 

4,118 

'll 

175 

would 

18,448 

wouldn't 

3,548 

needn’t 

148 

should 

17,760 

won’t 

3,100 

mightn’t 

68 

must 

16,520 

’d 

2,299 

oughtn’t 

44 

need 

9,955 

ought 

1,845 

mayn’t 

3 

can’t 

have 

6,375 

??? 

will 

862 

dare 

?? 

Figure  8.5  English  modal  verbs  from  the  CELEX  on-line  dictionary.  Fre- 
quency counts  are  from  the  COBUILD  16  million  word  corpus. 


English  also  has  many  words  of  more  or  less  unique  function,  includ- 
ing interjections  (oh,  ah,  hey,  man,  alas),  negatives  (no,  not),  politeness 
markers  (please,  thank  you),  greetings  (hello,  goodbye),  and  the  existen- 


INTERJEC- 

TIONS 


NEGATIVES 


POLITENESS 

MARKERS 


GREETINGS 


294 


Chapter  8.  Word  Classes  and  Part-of-Speech  Tagging 


there  tial  there  (there  are  two  on  the  table)  among  others.  Whether  these  classes 
arc  assigned  particular  names  or  lumped  together  (as  interjections  or  even 
adverbs)  depends  on  the  purpose  of  the  labeling. 


8.2  Tagsets  for  English 

The  previous  section  gave  broad  descriptions  of  the  kinds  of  lexical  classes 
that  English  words  fall  into.  This  section  fleshes  out  that  sketch  by  describ- 
ing the  actual  tagsets  used  in  part-of-speech  tagging,  in  preparation  for  the 
various  tagging  algorithms  to  be  described  in  the  following  sections. 

There  arc  a small  number  of  popular  tagsets  for  English,  many  of  which 
evolved  from  the  87-tag  tagset  used  for  the  Brown  corpus  (Francis,  1979; 
Francis  and  Kucera,  1982).  Three  of  the  most  commonly  used  arc  the  small 
45-tag  Penn  Treebank  tagset  (Marcus  el  al.,  1993),  the  medium-sized  61  tag 
C5  tagset  used  by  the  Lancaster  UCREL  project’s  CLAWS  (the  Constituent 
Likelihood  Automatic  Word-tagging  System)  tagger  to  tag  the  British  Na- 
tional Corpus  (BNC)  (Garside  el  al.,  1997),  and  the  larger  146-tag  C7  tagset 
(Leech  et  al.,  1994);  the  C5  and  C7  tagsets  are  listed  in  Appendix  C.  (Also 
see  Sampson  (1987)  and  Garside  et  al.  (1997)  for  a detailed  summary  of  the 
provenance  and  makeup  of  these  and  other  tagsets.)  This  section  will  present 
the  smallest  of  them,  the  Penn  Treebank  set,  and  then  discuss  specific  addi- 
tional tags  from  some  of  the  other  tagsets  that  might  be  useful  to  incorporate 
for  specific  projects. 

The  Penn  Treebank  tagset,  shown  in  Figure  8.6,  has  been  applied  to 
the  Brown  corpus  and  a number  of  other  corpora.  Here  is  an  example  of  a 
tagged  sentence  from  the  Penn  Treebank  version  of  the  Brown  corpus  (in  a 
flat  ASCII  file,  tags  are  often  represented  after  each  word,  following  a slash, 
but  tags  can  also  be  represented  in  various  other  ways): 

The/DT  grand/JJ  jury/NN  commented/VBD  on/IN  a/DT  num- 

ber/NN  of/IN  other/JJ  topics/NNS  ./. 

The  Penn  Treebank  tagset  was  culled  from  the  original  87-tag  tagset 
for  the  Brown  corpus.  This  reduced  set  leaves  out  information  that  can  be 
recovered  from  the  identity  of  the  lexical  item.  For  example  the  original 
Brown  tagset  and  other  large  tagsets  like  C5  include  a separate  tag  for  each 
of  the  different  forms  of  the  verbs  do  (e.g.  C5  tag  ‘VDD’  for  did  and  ‘VDG’ 
for  doing),  be,  and  have.  These  were  omitted  from  the  Penn  set. 
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Tag 

Description 

Example 

Tag 

Description 

Example 

CC 

Coordin.  Conjunction  and,  but,  or 

SYM  Symbol 

+,%,  & 

CD 

Cardinal  number 

one,  two,  three 

TO 

“to” 

to 

DT 

Determiner 

a,  the 

UH 

Interjection 

ah,  oops 

EX 

Existential  ‘there’ 

there 

VB 

Verb,  base  form 

eat 

FW 

Foreign  word 

mea  culpa 

VBD 

Verb,  past  tense 

ate 

IN 

Preposition/sub-conj 

of,  in,  by 

VBG 

Verb,  gerund 

eating 

JJ 

Adjective 

yellow 

VBN 

Verb,  past  participle  eaten 

JJR 

Adj.,  comparative 

bigger 

VBP 

Verb,  non-3sg  pres 

eat 

JJS 

Adj.,  superlative 

wildest 

VBZ 

Verb,  3sg  pres 

eats 

LS 

List  item  marker 

1,  2,  One 

WDT  Wh-determiner 

which,  that 

MD 

Modal 

can,  should 

WP 

Wh-pronoun 

what,  who 

NN 

Noun,  sing,  or  mass 

llama 

WP$ 

Possessive  wh- 

whose 

NNS 

Noun,  plural 

llamas 

WRB  Wh-adverb 

how,  where 

NNP 

Proper  noun,  singular 

IBM 

$ 

Dollar  sign 

$ 

NNPS 

Proper  noun,  plural 

Carolinas 

# 

Pound  sign 

# 

PDT 

Predeterminer 

all,  both 

66 

Left  quote 

(‘  or  “) 

POS 

Possessive  ending 

’s 

” 

Right  quote 

(’  or  ”) 

PP 

Personal  pronoun 

I,  you,  he 

( 

Left  parenthesis 

( [,  (,  {,  <) 

PP$ 

Possessive  pronoun 

your,  one’s 

) 

Right  parenthesis 

(],),  },  » 

RB 

Adverb 

quickly,  never 

, 

Comma 

, 

RBR 

Adverb,  comparative  faster 

Sentence-hnal  punc 

(•  ! ?) 

RBS 

RP 

Adverb,  superlative 
Particle 

fastest 
up,  off 

Mid-sentence  punc 

(:;...--) 

Figure  8.6  Penn  Treebank  Part-of-Speech  Tags  (Including  Punctuation) 

Certain  syntactic  distinctions  were  not  marked  in  the  Penn  Treebank 
tagset  because  Treebank  sentences  were  parsed,  not  merely  tagged,  and  so 
some  syntactic  information  is  represented  in  the  phrase  structure.  For  ex- 
ample, prepositions  and  subordinating  conjunctions  were  combined  into  the 
single  tag  IN,  since  the  tree-structure  of  the  sentence  disambiguated  them 
(subordinating  conjunctions  always  precede  clauses,  prepositions  precede 
noun  phrases  or  prepositional  phrases). 

Most  tagging  situations,  however,  do  not  involve  parsed  corpora;  for 
this  reason  the  Penn  Treebank  set  is  not  specific  enough  for  many  uses.  The 
C7  tagset,  for  example,  also  distinguishes  prepositions  ( II)  from  subordi- 
nating conjunctions  (CS)  , and  distinguishes  the  preposition  to  ( II)  from  the 
infinite  marker  to  (TO). 

Which  tagset  to  use  for  a particular  application  depends,  of  course,  on 
how  much  information  the  application  needs.  The  reader  should  see  Ap- 
pendix C for  a listing  of  the  C5  and  C7  tagsets. 
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8.3  Part  of  Speech  Tagging 

tagging  Part-of-speech  tagging  (or  just  tagging  for  short)  is  the  process  of  assigning 
a part-of-speech  or  other  lexical  class  marker  to  each  word  in  a corpus.  Tags 
arc  also  usually  applied  to  punctuation  markers;  thus  tagging  for  natural  lan- 
guage is  the  same  process  as  tokenization  for  computer  languages,  although 
tags  for  natural  languages  arc  much  more  ambiguous.  As  we  suggested  at 
the  beginning  of  the  chapter,  taggers  play  an  increasingly  important  role  in 
speech  recognition,  natural  language  parsing  and  information  retrieval. 

The  input  to  a tagging  algorithm  is  a string  of  words  and  a specified 
tagset  tagset  of  the  kind  described  in  the  previous  section.  The  output  is  a single 
best  tag  for  each  word.  For  example,  here  arc  some  sample  sentences  from 
the  ATIS  corpus  of  dialogues  about  air-travel  reservations  that  we  will  dis- 
cuss in  Chapter  9.  For  each  we  have  shown  a potential  tagged  output  using 
the  Penn  Treebank  tagset  defined  in  Figure  8.6  on  page  295: 

VB  DT  NN  . 

Book  that  flight  . 

VBZ  DT  NN  VB  NN  ? 

Does  that  flight  serve  dinner  ? 


Even  in  these  simple  examples,  automatically  assigning  a tag  to  each 
ambiguous  word  is  not  trivial.  For  example,  book  is  ambiguous.  That  is,  it  has  more 
than  one  possible  usage  and  paid  of  speech.  It  can  be  a verb  (as  in  book  that 
flight  or  to  book  the  suspect ) or  a noun  (as  in  hand  me  that  book , or  a book 
of  matches).  Similarly  that  can  be  a determiner  (as  in  Does  that  flight  serve 
dinner),  or  a complementizer  (as  in  I though  t that  your  fligh  t was  earlier). 
resolve  The  problem  of  POS -tagging  is  to  resolve  these  ambiguities,  choosing  the 
proper  tag  for  the  context.  Part-of-speech  tagging  is  thus  one  of  the  many 
disambiguation  tasks  we  will  see  in  this  book. 

How  hai'd  is  the  tagging  problem?  Most  words  in  English  are  unam- 
biguous; i.e.  they  have  only  a single  tag.  But  many  of  the  most  common 
words  of  English  are  ambiguous  (for  example  can  can  be  an  auxiliary  (‘to 
be  able’),  a noun  (‘a  metal  container’),  or  a verb  (‘to  put  something  in  such 
a metal  container’)).  In  fact  DeRose  (1988)  reports  that  while  only  11.5% 
of  English  word  types  in  the  Brown  Corpus  are  ambiguous,  over  40%  of 
Brown  tokens  are  ambiguous.  Based  on  Francis  and  Kucera  (1982),  he  gives 
the  table  of  tag  ambiguity  in  Figure  8.7. 
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Unambiguous  (1  tag) 

35,340 

Ambiguous  (2-7  tags) 

4,100 

2 tags 

3,760 

3 tags 

264 

4 tags 

61 

5 tags 

12 

6 tags 

2 

7 tags 

1 

Figure  8.7  The  number  of  word  types  in  Brown  corpus  by  degree  of  ambi- 
guity (after  DeRose  (1988)). 


Luckily,  it  turns  out  that  many  of  the  40%  ambiguous  tokens  arc  easy 
to  disambiguate.  This  is  because  the  various  tags  associated  with  a word 
arc  not  equally  likely.  For  example,  a can  be  a determiner,  or  the  letter  a 
(perhaps  as  paid  of  an  acronym  or  an  initial).  But  the  determiner  sense  of  a 
is  much  more  likely. 

Most  tagging  algorithms  fall  into  one  of  two  classes:  rule-based  tag- 
gers and  stochastic  taggers.  Rule -based  taggers  generally  involve  a large 
database  of  hand-written  disambiguation  rule  which  specify,  for  example, 
that  an  ambiguous  word  is  a noun  rather  than  a verb  if  it  follows  a de- 
terminer. The  next  section  will  describe  a sample  rule-based  tagger,  EN- 
GTWOL,  based  on  the  Constraint  Grammar  architecture  of  Karlsson  el  al. 

(1995). 

Stochastic  taggers  generally  resolve  tagging  ambiguities  by  using  a 
training  corpus  to  compute  the  probability  of  a given  word  having  a given 
tag  in  a given  context.  Section  8.5  describes  a stochastic  tagger  called  HMM 
tagger,  also  called  a Maximum  Likelihood  Tagger,  or  a Markov  model  hmm  tagger 
tagger,  based  on  the  Flidden  Markov  Model  presented  in  Chapter  7. 

Finally,  Section  8.6  will  describe  an  approach  to  tagging  called  the 
transformation-based  tagger  or  the  Brill  tagger,  after  Brill  (1995).  The  f^LGLER 
Brill  tagger  shares  features  of  both  tagging  architectures.  Like  the  rule -based 
tagger,  it  is  based  on  rules  which  determine  when  an  ambiguous  word  should 
have  a given  tag.  Like  the  stochastic  taggers,  it  has  a machine-learning  com- 
ponent: the  rules  arc  automatically  induced  from  a previously-tagged  train- 
ing corpus. 
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8.4  Rule-based  Part-of-speech  Tagging 


The  earliest  algorithms  for  automatically  assigning  part -of- speech  were  based 
on  a two-stage  architecture  (Harris,  1962;  Klein  and  Simmons,  1963;  Greene 
and  Rubin,  1971).  The  first  stage  used  a dictionary  to  assign  each  word  a list 
of  potential  parts  of  speech.  The  second  stage  used  large  lists  of  hand- written 
disambiguation  rules  to  winnow  down  this  list  to  a single  part-of-speech  for 
each  word. 

engtwol  The  ENGTWOL  tagger  (Voutilainen,  1995)  is  based  on  the  same  two- 

stage  architecture,  although  both  the  lexicon  and  the  disambiguation  rules 
arc  much  more  sophisticated  than  the  early  algorithms.  The  ENGTWOL 
lexicon  is  based  on  the  two-level  morphology  described  in  Chapter  3,  and 
has  about  56,000  entries  for  English  word  stems  (Heikkila,  1995),  counting 
a word  with  multiple  parts  of  speech  (e.g.  nominal  and  verbal  senses  of  hit) 
as  separate  entries,  and  of  course  not  counting  inflected  and  many  derived 
forms.  Each  entry  is  annotated  with  a set  of  morphological  and  syntactic 
features.  Figure  8.8  shows  some  selected  words,  together  with  a slightly 
simplified  listing  of  their  features. 


Word 

POS 

Additional  POS  features 

smaller 

ADJ 

COMPARATIVE 

entire 

ADJ 

ABSOLUTE  ATTRIBUTIVE 

fast 

ADV 

SUPERLATIVE 

that 

DET 

CENTRAL  DEMONSTRATIVE  SG 

all 

DET 

PREDETERMINER  SG/PL  QUANTIFIER 

dog’s 

N 

GENITIVE  SG 

furniture 

N 

NOMINATIVE  SG  NOINDEFDETERMINER 

one-third 

NUM 

SG 

she 

PRON 

PERSONAL  FEMININE  NOMINATIVE  SG3 

show 

V 

IMPERATIVE  VFIN 

show 

V 

PRESENT  -SG3  VFIN 

show 

N 

NOMINATIVE  SG 

shown 

PCP2 

SVOO  SVO  SV 

occurred 

PCP2 

SV 

occurred 

V 

PAST  VFIN  SV 

Figure  8.8 

Sample  lexical  entries  from  the  ENGTWOL  lexicon  described 

in  Voutilainen  (1995)  and  Heikkila  (1995). 

Most  of  the  features  in  Figure  8.8  arc  relatively  self-explanatory;  SG 
for  singular,  -SG3  for  other  than  third-person-singular.  ABSOLUTE  means 
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non-comparative  and  non-superlative  for  an  adjective,  NOMINATIVE  just 
means  non-genitive,  and  PCP2  means  past  participle.  PRE,  CENTRAL, 
and  POST  arc  ordering  slots  for  determiners  (predeterminers  (all)  come  be- 
fore determiners  (the):  all  the  president’s  men).  NOINDEFDETERMINER 
means  that  words  like  furniture  do  not  appeal-  with  the  indefinite  determiner 
a.  SV,  SVO,  and  SVOO  specify  the  subcategorization  or  complementa- 
tion pattern  for  the  verb.  Subcategorization  will  be  discussed  in  Chapter  9 
and  Chapter  11,  but  briefly  SV  means  the  verb  appeal's  solely  with  a subject 
(nothing  occurred );  SVO  with  a subject  and  an  object  (I  showed  the  film)-, 
SVOO  with  a subject  and  two  complements:  She  showed  her  the  ball. 

In  the  first  stage  of  the  tagger,  each  word  is  run  through  the  two-level 
lexicon  transducer  and  the  entries  for  all  possible  parts  of  speech  are  re- 
turned. For  example  the  phrase  Pavlov  had  shown  that  salivation. . . would 
return  the  following  list  (one  line  per  possible  tag,  with  the  correct  tag  shown 
in  boldface): 


Pavlov  PAVLOV  N NOM  SG  PROPER 
had  HAVE  V PAST  VFIN  SVO 

HAVE  PCP2  SVO 

shown  SHOW  PCP2  SVOO  SVO  SV 
that  ADV 

PRON  DEM  SG 

DET  CENTRAL  DEM  SG 

CS 

salivation  N NOM  SG 


A set  of  about  1,100  constraints  arc  then  applied  to  the  input  sentence 
to  rule  out  incorrect  parts  of  speech;  the  boldfaced  entries  in  the  table  above 
show  the  desired  result,  in  which  the  preterite  (not  participle)  tag  is  applied 
to  had,  and  the  complementizer  (CS)  tag  is  applied  the  that.  The  constraints 
arc  used  in  a negative  way,  to  eliminate  tags  that  arc  inconsistent  with  the 
context.  For  example  one  constraint  eliminates  all  readings  of  that  except 
the  ADV  (adverbial  intensifier)  sense  (this  is  the  sense  in  the  sentence  it  isn ’t 
that  odd).  Here’s  a simplified  version  of  the  constraint: 


SUBCATEGO- 

RIZATION 

COMPLEMEN- 

TATION 
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Adverbial-that  rule 

Given  input:  ’’that” 
if 

(+1  A/ADV/QUANT);  / * if  next  word  is  adj,  adverb,  or  quantifier  * / 

(+2  SENT-LIM);  /*  and  following  which  is  a sentence  boundary,  */ 
(NOT  -1  SVOC/A);  /*  and  the  previous  word  is  not  a verb  like  */ 

/ * ’consider’  which  allows  adjs  as  object  complements  * / 
then  eliminate  non- AD V tags 
else  eliminate  ADV  tag 

The  first  two  clauses  of  this  rule  check  to  see  that  the  that  directly 
precedes  a sentence-final  adjective,  adverb,  or  quantifier.  In  all  other  cases 
the  adverb  reading  is  eliminated.  The  last  clause  eliminates  cases  preceded 
by  verbs  like  consider  or  believe  which  can  take  a noun  and  an  adjective;  this 
is  to  avoid  tagging  the  following  instance  of  that  as  an  adverb: 

I consider  that  odd. 

Another  rule  is  used  to  express  the  constraint  that  the  complementizer 
sense  of  that  is  most  likely  to  be  used  if  the  previous  word  is  a verb  which  ex- 
pects a complement  (like  believe,  think , or  show),  and  if  the  that  is  followed 
by  the  beginning  of  a noun  phrase,  and  a finite  verb. 

This  description  oversimplifies  the  ENGTWOL  architecture;  the  sys- 
tem also  includes  probabilistic  constraints,  and  also  makes  use  of  other  syn- 
tactic information  we  haven’t  discussed.  The  interested  reader  should  con- 
sult Karlsson  et  al,  (1995). 


8.5  Stochastic  Part-of-speech  Tagging 


The  use  of  probabilities  in  tags  is  quite  old;  probabilities  in  tagging  were 
first  used  by  (Stolz  et  al,  1965),  a complete  probabilistic  tagger  with  Viterbi 
decoding  was  sketched  by  Bahl  and  Mercer  (1976),  and  various  stochastic 
taggers  were  built  in  the  1980’s  (Marshall,  1983;  Garside,  1987;  Church, 
1988;  DeRose,  1988).  This  section  describes  a particular  stochastic  tagging 
algorithm  generally  known  as  the  Hidden  Markov  Model  or  HMM  tagger. 
The  intuition  behind  all  stochastic  taggers  is  a simple  generalization  of  the 
‘pick  the  most-likely  tag  for  this  word’  approach  that  we  discussed  above, 
based  on  the  Bayesian  framework  we  saw  in  Chapter  5. 

For  a given  sentence  or  word  sequence,  HMM  taggers  choose  the  tag 
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sequence  that  maximizes  the  following  formula: 

P(word|tag)  *P (tag [previous  n tags)  (8.1) 

The  rest  of  this  section  will  explain  and  motivate  this  particular  equa- 
tion. HMM  taggers  generally  choose  a tag  sequence  for  a whole  sentence 
rather  than  for  a single  word,  but  for  pedagogical  puiposes,  let’s  first  see 
how  an  HMM  tagger  assigns  a tag  to  an  individual  word.  We  first  give  the 
basic  equation,  then  work  through  an  example,  and,  finally,  give  the  motiva- 
tion for  the  equation. 

A bigram-HMM  tagger  of  this  kind  chooses  the  tag  t;  for  word  w;  that 
is  most  probable  given  the  previous  tag  t,  \ and  the  current  word  w,-: 

tj  = argmaxP(tj\ti-i,Wi)  (8.2) 

j 

Through  some  simplifying  Markov  assumptions  that  we  will  give  below,  we 
restate  Equation  8.2  to  give  the  basic  HMM  equation  for  a single  tag  as 
follows: 

h = argmaxP(tj\ti-i)P(wi\tj)  (8.3) 

j 

A Motivating  Example 

Let’s  work  through  an  example,  using  an  HMM  tagger  to  assign  the  proper 
tag  to  the  single  word  race  in  the  following  examples  (both  shortened  slightly 
from  the  Brown  corpus): 

(8.4)  Secrctariat/NNP  is/VBZ  cxpcctcd/VBN  to/TO  race/VB 
tomorrow/NN 

(8.5)  People/NNS  continue/VBP  to/TO  inquire/VB  the/DT  reason/NN 
for/IN  the/DT  race/NN  for/IN  outer/JJ  space/NN 

In  the  first  example  race  is  a verb  (VB),  in  the  second  a noun  (NN). 
For  the  purposes  of  this  example,  let’s  pretend  that  some  other  mech- 
anism has  already  done  the  best  tagging  job  possible  on  the  surrounding 
words,  leaving  only  the  word  race  untagged.  A bigram  version  of  the  HMM 
tagger  makes  the  simplifying  assumption  that  the  tagging  problem  can  be 
solved  by  looking  at  nearby  words  and  tags.  Consider  the  problem  of  as- 
signing a tag  to  race  given  just  these  subsequences: 
to/TO  race/??? 
the/DT  race/??? 

Let’s  see  how  this  equation  applies  to  our  example  with  race ; Equa- 
tion 8.3  says  that  if  we  arc  trying  to  choose  between  NN  and  VB  for  the 
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sequence  to  race,  we  choose  the  tag  that  has  the  greater  of  these  two  proba- 
bilities: 

P(VB|TO)P(race|VB)  (8.6) 

and 

P(NN|TO)P(race|NN)  (8.7) 

Equation  8.3  and  its  instantiations  Equations  8.6  and  8.7  each  have 
two  probabilities:  a tag  sequence  probability  P(q|q  i)  and  a word-likelihood 
P(wi\tj).  For  race,  the  tag  sequence  probabilities  P(NN|TO)  and  P(VB|TO) 
give  us  the  answer  to  the  question  “how  likely  are  we  to  expect  a verb  (noun) 
given  the  previous  tag?”.  They  can  just  be  computed  from  a corpus  by  count- 
ing and  normalizing.  We  would  expect  that  a verb  is  more  likely  to  follow 
TO  than  a noun  is,  since  infinitives  (to  race,  to  run,  to  eat ) arc  common  in 
English.  While  it  is  possible  for  a noun  to  follow  TO  ( walk  to  school,  related 
to  hunting),  it  is  less  common. 

Sure  enough,  a look  at  the  combined  Brown  and  Switchboard  corpora 
gives  us  the  following  probabilities,  showing  that  verbs  arc  fifteen  times  as 
likely  as  nouns  after  TO: 

P(NN|TO)  = .021 

P(VB|TO)  = .34 

The  second  part  of  Equation  8.3  and  its  instantiations  Equations  8.6 
and  8.7  is  the  lexical  likelihood:  the  likelihood  of  the  noun  race  given  each 
tag,  Thrace  VB)  and  P (race |NN).  Note  that  this  likelihood  term  is  not  asking 
‘which  is  the  most  likely  tag  for  this  word’.  That  is,  the  likelihood  term  is  not 
T'(VB|race).  Instead  we  arc  computing  T’(race|VB).  The  probability,  slightly 
counterintuitively,  answers  the  question  “if  we  were  expecting  a verb,  how 
likely  is  it  that  this  verb  would  be  race”. 

Here  are  the  lexical  likelihoods  from  the  combined  Brown  and  Switch- 
board corpora: 


P(race|NN)  = .00041 
P(race|VB)  = .00003 

If  we  multiply  the  lexical  likelihoods  with  the  tag  sequence  probabili- 
ties, we  see  that  even  the  simple  bigram  version  of  the  HMM  tagger  correctly 
tags  race  as  a VB  despite  the  fact  that  it  is  the  less  likely  sense  of  race: 
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P(VB|TO)P(race|VB)  = .00001 
P(NN|TO)P(race|NN)  = .000007 


The  Actual  Algorithm  for  HMM  tagging 

Recall  that  a real  HMM  tagger  isn’t  just  choosing  the  best  tag  for  a simple 
word  but  the  best  sequence  of  tags  for  an  entire  sentence.  Now  that  we’ve 
seen  the  single-word  intuition  for  the  HMM  tagger,  let’s  give  the  complete 
equations.  Generally,  we  make  the  Viterbi  approximation  and  choose  the 
most  probable  tag  sequence  for  each  sentence.  This  approach  thus  assumes 
that  we  arc  trying  to  compute  for  each  sentence  the  most  probable  sequence 
of  tags  T = ti,t2,  ■ ■ - tn  given  the  sequence  of  words  in  the  sentence  (W ): 

T = ai'gmaxP(r|W) 
rex 

By  Bayes  Law,  P(T\W)  can  be  expressed  as: 

f|TIJ 

V 1 ' P{W) 

Thus  we  arc  attempting  to  choose  the  sequence  of  tags  that  maximizes  P^Tp^^T 


T = argmax 
rex 


P(T)P(W\T) 

P{W) 


Since  we  arc  looking  for  the  most  likely  tag  sequence  for  a sentence 
given  a particular  word  sequence,  the  probability  of  the  word  sequence  P(W ) 
will  be  the  same  for  each  tag  sequence  and  we  can  ignore  it. 


f = argmaxP  (T)P(W\T) 
rex 


From  the  chain  rule  of  probability: 


n 


P(T)P(W\T)  =FP(  Wi\wih-  ■ ■ W i-\ti-\ti)P (t',\w \t\  . ..Wi-itj-i) 

i=  1 


We  make  the  /V-gram  assumption  again,  just  as  we  did  in  Equation  6.9 
for  modeling  the  probability  of  word  sequences.  The  trigram  model  is  most 
often  used,  so  let’s  define  that  one.  First,  we  make  the  simplifying  assump- 
tion that  the  probability  of  a word  is  dependent  only  its  tag: 


P(Wi\wih  ■ ■ -Wi-\ tj-ltj)  = p{Wi\ti) 
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Next,  we  make  the  assumption  that  the  tag  history  can  be  approximated 
by  the  most  recent  two  tags: 

P(ti\wih  . . .Wi-lti-l)  = P(ti\ti-2ti-l) 

Thus  we  are  choosing  the  tag  sequence  that  maximizes: 

P(h  )P(t2  |o ) ) [IpM/,-)] 

i= 3 i-1 

As  usual,  we  can  use  maximum  likelihood  estimation  from  relative 
frequencies  to  estimate  these  probabilities. 


P{ti\ti-2ti-i) 


P(Wi\tj) 


c(tj—2h—\ti) 
C(ti-2ti- 1) 
cjWj.ti) 

c(ti) 


This  model  can  also  be  smoothed  (for  example  by  the  backoff  or  deleted 
interpolation  algorithms  of  Chapter  6)  to  avoid  zero  probabilities. 

Finding  the  most  probable  tag  sequence  can  be  done  with  the  Viterbi 
algorithm  described  in  Chapter  7. 

Weischedel  et  al.  (1993)  and  DeRose  (1988)  have  reported  accuracies 
of  above  96%  for  this  algorithm. 

The  HMM  tagger  we  have  seen  so  far  is  trained  on  hand-tagged  data. 
Kupiec  (1992),  Cutting  et  al.  (1992a),  and  others  show  that  it  is  also  possible 
to  train  an  HMM  tagger  on  unlabeled  data,  using  the  EM  algorithm  of  Chap- 
ter 7 and  Appendix  D.  These  taggers  still  start  with  a dictionary  which  lists 
which  tags  can  be  assigned  to  which  words;  the  EM  algorithm  then  learns 
the  word  likelihood  function  for  each  tag,  and  the  tag  transition  probabili- 
ties. An  experiment  by  Merialdo  (1994),  however,  indicates  that  with  even  a 
small  amount  of  training  data,  a tagger  trained  on  hand-tagged  data  worked 
better  than  one  trained  via  EM.  Thus  the  EM-trained  ‘pure  HMM’  tagger  is 
probably  best  suited  in  cases  where  no  training  data  is  available,  for  example 
when  tagging  languages  for  which  there  is  no  previously  hand-tagged  data. 


8.6  Transformation-Based  Tagging 

Transformation-Based  Tagging,  sometimes  called  Brill  tagging,  is  an  in- 
BASEDF0RMATI0Nstance  of  the  Transformation-Based  Learning  (TBL)  approach  to  machine 

LEARNING 

learning  (Brill,  1995),  and  draws  inspiration  from  both  the  rule -based  and 


Methodology  Box:  Evaluating  Taggers 


Taggers  arc  often  evaluating  by  comparing  them  with  a human- 
labeled  Gold  Standard  test  set,  based  on  percent  correct:  the  per- 
centage of  all  tags  in  the  test  set  where  the  tagger  and  the  Gold 
standard  agree.  Most  current  tagging  algorithms  have  an  accuracy 
(percent-correct)  of  around  96%  to  97%  for  simple  tagsets  like  the 
Penn  Treebank  set;  human  annotators  can  then  be  used  to  manually 
post-process  the  tagged  corpus. 

How  good  is  96 %?  Since  tag  sets  and  tasks  differ,  the  perfor- 
mance of  tags  can  be  compared  against  a lower-bound  baseline  and 
an  upper-bound  ceiling.  One  way  to  set  a ceiling  is  to  see  how  well 
humans  do  on  the  task.  Marcus  et  al.  (1993),  for  example,  found  that 
human  annotators  agreed  on  about  96-97%  of  the  tags  in  the  Penn 
Treebank  version  of  the  Brown  Corpus.  This  suggests  that  the  Gold 
Standard  may  have  a 3-4%  margin  of  error,  and  that  it  is  not  possi- 
ble to  get  100%  accuracy.  Two  experiments  by  Voutilainen  (1995, 
p.  174),  however,  found  that  if  humans  were  allowed  to  discuss  the 
tags,  they  reached  consensus  on  100%  of  the  tags. 

Key  Concept  #6.  Human  Ceiling:  When  using  a human 
Gold  Standard  to  evaluate  a classification  algorithm,  check  the 
agreement  rate  of  humans  on  the  standard. 

The  standard  baseline,  suggested  by  Gale  et  al.  (1992)  (in 
the  slightly  different  context  of  word-sense  disambiguation),  is  to 
choose  the  unigram  most-likely  tag  for  each  ambiguous  word.  The 
most-likely  tag  for  each  word  can  be  computed  from  a hand-tagged 
corpus  (which  may  be  the  same  as  the  training  corpus  for  the  tagger 
being  evaluated). 

Key  Concept  #7.  Unigram  Baseline:  When  designing  a new 
classification  algorithm,  always  compare  it  against  the  unigram 
baseline  (assigning  each  token  to  the  class  it  occurred  in  most 
often  in  the  training  set). 

Charniak  et  al.  (1993)  showed  that  a (slightly  smoothed)  version 
of  this  baseline  algorithm  achieves  an  accuracy  of  90-91%!  Tagging 
algorithms  since  Harris  (1962)  have  incorporated  this  intuition  about 
tag-frequency. 
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stochastic  taggers.  Like  the  rule-based  taggers,  TBL  is  based  on  rules  that 
specify  what  tags  should  be  assigned  to  what  words.  But  like  the  stochastic 
taggers,  TBL  is  a machine  learning  technique,  in  which  rules  arc  automati- 
cally induced  from  the  data.  Like  some  but  not  all  of  the  HMM  taggers,  TBL 
is  a supervised  learning  technique;  it  assumes  a pre-tagged  training  corpus. 

Samuel  el  al.  (1998a)  offer  a useful  analogy  for  understanding  the  TBL 
paradigm,  which  they  credit  to  Terry  Harvey.  Imagine  an  artist  painting  a 
picture  of  a white  house  with  green  trim  against  a blue  sky.  Suppose  most  of 
the  picture  was  sky,  and  hence  most  of  the  picture  was  blue.  The  artist  might 
begin  by  using  a very  broad  brush  and  painting  the  entire  canvas  blue.  Next 
she  might  switch  to  a somewhat  smaller  white  brush,  and  paint  the  entire 
house  white.  She  would  just  color  in  the  whole  house,  not  worrying  about 
the  brown  roof,  or  the  blue  windows  or  the  green  gables.  Next  she  takes  a 
smaller  brown  brush  and  colors  over  the  roof.  Now  she  takes  up  the  blue 
paint  on  a small  brush  and  paints  in  the  blue  windows  on  the  barn.  Finally 
she  takes  a very  fine  green  brush  and  does  the  trim  on  the  gables. 

The  painter  starts  with  a broad  brush  that  covers  a lot  of  the  canvas 
but  colors  a lot  of  areas  that  will  have  to  be  repainted.  The  next  layer  col- 
ors less  of  the  canvas,  but  also  makes  less  ‘mistakes’.  Each  new  layer  uses 
a finer  brush  that  corrects  less  of  the  picture,  but  makes  fewer  mistakes. 
TBL  uses  somewhat  the  same  method  as  this  painter.  The  TBL  algorithm 
has  a set  of  tagging  rules.  A corpus  is  first  tagged  using  the  broadest  rule, 
i.e.  the  one  that  applies  to  the  most  cases.  Then  a slightly  more  specific 
rule  is  chosen,  which  changes  some  of  the  original  tags.  Next  an  even  nar- 
rower rule,  which  changes  a smaller  number  of  tags  (some  of  which  might 
be  previously-changed  tags). 

How  TBL  rules  are  applied 

Let’s  look  at  one  of  the  rules  used  by  Brill’s  (1995)  tagger.  Before  the  rules 
apply,  the  tagger  labels  every  word  with  its  most-likely  tag.  We  get  these 
most-likely  tags  from  a tagged  coipus.  For  example,  in  the  Brown  corpus, 
race  is  most  likely  to  be  a noun: 

P(NN|race)  = .98 

P(VB|race)  = .02 

This  means  that  the  two  examples  of  race  that  we  saw  above  will  both 
be  coded  as  NN.  In  the  first  case,  this  is  a mistake,  as  NN  is  the  incorrect 
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tag: 

(8.8)  is/VBZ  expected/VBN  to/TO  race/NN  tomorrow/NN 

In  the  second  case  this  race  is  correctly  tagged  as  an  NN : 

(8.9)  the/DT  race/NN  for/IN  outer/JJ  space/NN 

After  selecting  the  most-likely  tag.  Brill's  tagger  applies  its  transfor- 
mation rules.  As  it  happens.  Brill’s  tagger  learned  a rule  that  applies  exactly 
to  this  mistagging  of  race: 

Change  NN  to  VB  when  the  previous  tag  is  TO 

This  rule  would  change  race/NN  to  race/VB  in  exactly  the  following 
situation,  since  it  is  preceded  by  to/TO: 

(8.10)  expected/VBN  to/TO  race/NN  — > expected/VBN  to/TO  race/VB 

How  TBL  Rules  are  Learned 

Brill’s  TBL  algorithm  has  three  major  stages.  It  first  labels  every  word  with 
its  most-likely  tag.  It  then  examines  every  possible  transformation,  and  se- 
lects the  one  that  results  in  the  most  improved  tagging.  Finally,  it  then  re-tags 
the  data  according  to  this  rule.  These  three  stages  are  repeated  until  some 
stopping  criterion  is  reached,  such  as  insufficient  improvement  over  the  pre- 
vious pass.  Note  that  stage  two  requires  that  TBL  knows  the  correct  tag  of 
each  word;  i.e.,  TBL  is  a supervised  learning  algorithm. 

The  output  of  the  TBL  process  is  an  ordered  list  of  transformations; 
these  then  constitute  a ‘tagging  procedure'  that  can  be  applied  to  a new  cor- 
pus. In  principle  the  set  of  possible  transformations  is  infinite,  since  we 
could  imagine  transformations  such  as  “transform  NN  to  VB  if  the  previous 
word  was  ‘IBM’  and  the  word  ‘the’  occurs  between  17  and  158  words  before 
that”.  But  TBL  needs  to  consider  every  possible  transformation,  in  order  to 
pick  the  best  one  on  each  pass  through  the  algorithm.  Thus  the  algorithm 
needs  a way  to  limit  the  set  of  transformations.  This  is  done  by  designing 
a small  set  of  templates,  abstracted  transformations.  Every  allowable  trans-  templates 
formation  is  an  instantiation  of  one  of  the  templates.  Brill’s  set  of  templates 
is  listed  in  Figure  8.9.  Figure  8.10  gives  the  details  of  this  algorithm  for 
learning  transformations. 

At  the  heart  of  Figure  8.10  arc  the  two  functions  Get_Best_Trans- 
formation  and  Get_Best_Instance.  Get_Best_Transformation 
is  called  with  a list  of  potential  templates;  for  each  template,  it  calls 
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The  preceding  (following)  word  is  tagged  z. 

The  word  two  before  (after)  is  tagged  z. 

One  of  the  two  preceding  (following)  words  is  tagged  z. 

One  of  the  three  preceding  (following)  words  is  tagged  z. 

The  preceding  word  is  tagged  z and  the  following  word  is  tagged  w. 

The  preceding  (following)  word  is  tagged  z and  the  word 
two  before  (after)  is  tagged  w. 

Figure  8.9  Brill’s  (1995)  templates.  Each  begins  with  ‘Change  tag  a to  tag 
b when:  The  variables  a,  b,  z,  and  w range  over  parts  of  speech. 


Get_Best_Instance.  Get_Best_Instance  iteratively  tests  every  pos- 
sible instantiation  of  each  template  by  filling  in  specific  values  for  the  tag 
variables  a,  b,  z and  w. 

In  practice,  there  arc  a number  of  ways  to  make  the  algorithm  more 
efficient.  For  example,  templates  and  instantiated  transformations  can  be 
suggested  in  a data-driven  manner;  a transformation-instance  might  only  be 
suggested  if  it  would  improve  the  tagging  of  some  specific  word.  The  search 
can  also  be  made  more  efficient  by  pre-indexing  the  words  in  the  training 
corpus  by  potential  transformation.  Roche  and  Schabes  (1997a)  show  how 
the  tagger  can  also  be  speeded  up  by  converting  each  rule  into  a finite-state 
transducer  and  composing  all  the  transducers. 

Figure  8.11  shows  a few  of  the  rules  learned  by  Brill’s  original  tagger. 


8.7  Other  Issues 

Multiple  tags  and  multiple  words 

Two  issues  that  arise  in  tagging  arc  tag  indeterminacy  and  multi -part  words. 
Tag  indeterminacy  arises  when  a word  is  ambiguous  between  multiple  tags 
and  it  is  impossible  or  very  difficult  to  disambiguate.  In  this  case,  some 
taggers  allow  the  use  of  multiple  tags.  This  is  the  case  in  the  Penn  Treebank 
and  in  the  British  National  Corpus.  Common  tag  indeterminacies  include  ad- 
jective versus  preterite  versus  past  participle  (J.I/VB D/VBN),  and  adjective 
versus  noun  as  prenominal  modifier  (JJ/NN). 

The  second  issue  concerns  multi-part  words.  The  C5  and  C7  tagsets, 
for  example,  allow  prepositions  like  ‘in  terms  of  to  be  treated  as  a single 
word  by  adding  numbers  to  each  tag: 
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function  TBL (corpus)  returns  transforms-queue 
lNTIALIZE-WITH-MOST-LIKELY-TAGS(cor/;n.0 
until  end  condition  is  met  do 

templates  «-  Generate-Potential-Relevant-Templates 
best-transform  GET-BEST-TRANSFORM(cor/;ni,  templates ) 
APPLY-TRANSFORM(Z?esf-fra«s/orw,  corpus ) 

ENQUEUE  (best-transform- rule,  transforms-queue) 

end 

return  (transforms-queue) 

function  GET-BEST-TRANSFORM(cor/;>n.s,  templates)  returns  transform 
for  each  template  in  templates 
( instance , score)  4—  GET-BEST-lNSTANCE(cor/;n.s,  template) 
if  ( score  > best-transform.score)  then  best-transform  4—  (instance , score) 
return  (best-transform) 

function  G H T-  B I ■:  S T- 1 N S T A N C e(co rp us,  te  m p I ate ) returns  transform 
for  from-tag  t—  from  tag—  1 to  tag—n  do 
for  to-tag  <—  from  tag—  1 to  tag—n  do 
for  pos  <—  from  1 to  corpus-size  do 
if  (correct-tag(pos)  ==  to-tag  &&  current-tag(pos)  ==  from-tag) 
num-good-transforms(current-tag(pos—l))++ 
elseif  (correct-tag(pos)==from-tag  &&  current-tag(pos)==from-tag) 
num-bad-transforms(current-tag(pos—l))++ 

end 

best-Z ARGMAXt(num-good-transforms(i)  - num-bad-transforms(t)) 
ii(num-good-transforms(best-Z)  - num-bad-transforms(best-Z) 

> best-instance.Z)  then 

best-instance  <—  “Change  tag  from  from-tag  to  to-tag 
if  previous  tag  is  best-Z' 

return  (best-instance) 

procedure  A P P I.Y- Tr a N S I ;() R M ( transforni,  corpus) 
for  pos  4—  from  1 to  corpus-size  do 
if  ( current-tag(pos)==best-rule-from) 

&&  ( current-tag(pos—\)==best-rule-prev )) 
current-tag(pos)  = best-rule-to 


Figure  8.10  The  TBL  algorithm  for  learning  to  tag.  Get_Best_Instance 
would  have  to  change  for  transformations  templates  other  than  ‘Change  tag 
from  X to  Y if  previous  tag  is  Z’.  After  Brill  (1995). 
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HAPAX 

LEGOMENA 


Change  tags 

# 

From 

To 

Condition 

Example 

T 

NN 

VB 

Previous  tag  is  TO 

to/TO  race/NN  — » VB 

2 

VBP 

VB 

One  of  the  previous  3 tags  is  MD 

might/MD  vanish/VBP  — » VB 

3 

NN 

VB 

One  of  the  previous  2 tags  is  MD 

might/MD  not  reply/NN  — » VB 

4 

VB 

NN 

One  of  the  previous  2 tags  is  DT 

5 

VBD 

VBN 

One  of  the  previous  3 tags  is  VBZ 

Figure  8.11 

The  first  20  nonlexicalized  transformations  from  the  Brill  tag- 

ger  (Brill,  1995). 

in/II3 1 terms/1132  of/II33 

Finally,  some  tagged  corpora  split  certain  words;  for  example  the  Penn 
Treebank  and  the  British  National  Corpus  splits  contractions  and  the  ’s- 
genitive  from  their  stems: 

would/MD  n’t/RB 
children/NNS  ’s/POS 


Unknown  words 

All  the  tagging  algorithms  we  have  discussed  require  a dictionary  that  lists 
the  possible  parts  of  speech  of  every  word.  But  the  largest  dictionary  will 
still  not  contain  every  possible  word,  as  we  saw  in  Chapter  4.  Proper  names 
and  acronyms  arc  created  very  often,  and  even  new  common  nouns  and  verbs 
enter  the  language  at  a surprising  rate.  Therefore  in  order  to  build  a complete 
tagger  we  need  some  method  for  guessing  the  tag  of  an  unknown  word. 

The  simplest  possible  unknown-word  algorithm  is  to  pretend  that  each 
unknown  word  is  ambiguous  among  all  possible  tags,  with  equal  probability. 
Then  the  tagger  must  rely  solely  on  the  contextual  POS -trigrams  to  suggest 
the  proper  tag.  A slightly  more  complex  algorithm  is  based  on  the  idea  that 
the  probability  distribution  of  tags  over  unknown  words  is  very  similar  to  the 
distribution  of  tags  over  words  that  occurred  only  once  in  a training  set.  an 
idea  that  was  suggested  by  both  Baayen  and  Sproat  (1996)  and  Dermatas  and 
Kokkinakis  (1995).  These  words  that  only  occur  once  arc  known  as  hapax 
legomena  (singular  hapax  legomenon).  For  example,  unknown  words  and 
hapax  legomena  arc  similar  in  that  they  arc  both  most  likely  to  be  nouns, 
followed  by  verbs,  but  arc  very  unlikely  to  be  determiners  or  interjections. 
Thus  the  likelihood  P(w;\t/)  for  an  unknown  word  is  determined  by  the  av- 
erage of  the  distribution  over  all  singleton  words  in  the  training  set.  (Recall 
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Methodology  Box:  Error  Analysis 

In  order  to  improve  a computational  model  we  need  to  analyze 
and  understand  where  it  went  wrong.  Analyzing  the  error  in  a pattern 
classifier  like  a part-of-speech  tagger  is  usually  done  via  a confusion 
matrix,  also  called  a contingency  table.  A confusion  matrix  for  an 
N-way  classification  task  is  an  N-by-N  matrix  where  the  cell  (x.y) 
contains  the  number  of  times  an  item  with  correction  classification 
x was  classified  by  the  model  as  y.  For  example,  the  following  ta- 
ble shows  a portion  of  the  confusion  matrix  from  the  HMM  tagging 
experiments  of  Franz  (1996).  The  row  labels  indicate  correct  tags, 
column  labels  indicate  the  tagger’s  hypothesized  tags,  and  each  cell 
indicates  percentage  of  the  overall  tagging  error.  Thus  4.4%  of  the 
total  errors  were  caused  by  mistagging  a VBD  as  a VBN.  Common 
errors  arc  boldfaced  in  the  table. 


The  confusion  matrix  above,  and  related  error  analyses  in  Franz 
(1996),  Kupiec  (1992),  and  Ratnaparkhi  (1996),  suggest  that  some 
major  problems  facing  current  taggers  arc: 

1.  NN  versus  NNP  versus  JJ:  These  arc  hard  to  distinguish 
prenominally.  Distinguishing  proper  nouns  is  especially  im- 
portant for  information  extraction  and  machine  translation. 

2.  RP  versus  RB  versus  IN:  All  of  these  can  appeal-  in  sequences 
of  satellites  immediately  following  the  verb. 

3.  VBD  versus  VBN  versus  JJ:  Distinguishing  these  is  impor- 
tant for  partial  parsing  (participles  are  used  to  find  passives), 
and  for  correctly  labeling  the  edges  of  noun-phrases. 
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that  this  idea  of  using  ‘things  we’ve  seen  once’  as  an  estimator  for  ‘things 
we’ve  never  seen’  proved  useful  as  key  concept  Things  Seen  Once  in  the 
Witten-Bell  and  Good-Turing  algorithms  of  Chapter  6). 

The  most  powerful  unknown-word  algorithms  make  use  of  information 
about  how  the  word  is  spelled.  For  example,  words  that  end  in  the  letter  -s  arc 
likely  to  be  plural  nouns  (NNS),  while  words  ending  with  -ed  tend  to  be  past 
participles  (VBN).  Words  starting  with  capital  letters  arc  likely  to  be  nouns. 
Weischedel  et  al.  (1993)  used  four  specific  kinds  of  orthographic  features:  3 
inflectional  endings  {-ed,  -s,  -ing),  32  derivational  endings  (such  as  -ion,  -al, 
-ive,  and  -ly),  4 values  of  capitalization  (capitalized  initial,  capitalized  non- 
initial, etc.),  and  hyphenation.  They  used  the  following  equation  to  compute 
the  likelihood  of  an  unknown  word: 

P(wj\tj)  = p (unknown- word |q)  * p(capital|t;)  * p(endings/hyph|f,) 

Other  researchers,  rather  than  relying  on  these  hand-designed  features, 
have  used  machine  learning  to  induce  useful  features.  Brill  (1995)  used  the 
TBL  algorithm,  where  the  allowable  templates  were  defined  orthographi- 
cally  (the  first  N letters  of  the  words,  the  last  N letters  of  the  word,  etc). 
His  algorithm  induced  all  the  English  inflectional  features,  hyphenation,  and 
many  derivational  features  such  as  -ly,  al.  Franz  (1996)  uses  a logl incar 
model  which  includes  more  features,  such  as  the  length  of  the  word  and 
various  prefixes,  and  furthermore  includes  interaction  terms  among  various 
features. 

Class-based  N-grams 

Now  that  we  have  a way  of  automatically  assigning  a class  to  each  word  in 
a coipus,  we  can  use  this  information  to  augment  our  /V-grarn  models.  The 
“SED  class-based  N-gram  is  a valiant  of  the  N -gram  that  uses  the  frequency  of 
sequences  of  POS  (or  other)  classes  to  help  produce  a more  knowledgeable 
estimate  of  the  probability  of  word  strings.  The  basic  class-based  N-gram 
defines  the  conditional  probability  of  a word  wn  based  on  its  history  as  the 
product  of  the  two  factors:  the  probability  of  the  class  given  the  preceding 
classes  (based  on  a N-gram-of-classes),  and  the  probability  of  a particular 
word  given  the  class: 


P(wn\wnn-N+l)  =P{Wn\Cn)P{Cn\cnnJN+x) 

The  maximum  likelihood  estimate  (MLE)  of  the  probability  of  the 
word  given  the  class  and  the  probability  of  the  class  given  the  previous  class 


Methodology  Box:  Computing  Agreement  via  k 


One  problem  with  the  percent  correct  metric  for  evaluating 
taggers  is  that  it  doesn’t  control  for  how  easy  the  tagging  task  is. 
If  99%  of  the  tags  arc,  say,  NN,  then  getting  99%  correct  isn’t  very 
good;  we  could  have  gotten  99%  correct  just  by  guessing  NN.  This 
means  that  it’s  really  impossible  to  compare  taggers  which  arc  being 
run  on  different  test  sets  or  different  tasks.  As  the  previous  method- 
ology box  noted,  one  factor  that  can  help  normalize  different  values 
of  percent  correct  is  to  measure  the  difficulty  of  a given  task  via  the 
unigram  baseline  for  that  task. 

In  fact,  there  is  an  evaluation  statistic  called  kappa  (k)  that  takes 
this  baseline  into  account,  inherently  controlling  for  the  complex- 
ity of  the  task  (Siegel  and  Castellan,  1988;  Carletta,  1996).  Kappa 
can  be  used  instead  of  percent  correct  when  comparing  a tagger  to 
a Gold  Standard,  or  especially  when  comparing  human  labelers  to 
each  other,  when  there  is  no  one  correct  answer.  Kappa  is  the  ratio  of 
the  proportion  of  times  that  2 classifiers  agree  (corrected  for  chance 
agreement)  to  the  maximum  proportion  of  times  that  the  classifiers 
could  agree  (corrected  for  chance  agreement): 

P(A)  — P(E) 
l-P(E) 

P(A)  is  the  proportion  of  times  that  the  hypothesis  agrees  with  the 
standard;  i.e.,  percent  correct.  P(E)  is  the  proportion  of  times  that  the 
hypothesis  and  the  standard  would  be  expected  to  agree  by  chance. 
P(E)  can  be  computed  from  some  other  knowledge,  or  it  can  be  com- 
puted from  the  actual  confusion  matrix  for  the  labels  being  com- 
pared. The  bounds  for  K are  just  like  those  for  percent  correct;  when 
there  is  no  agreement  (other  than  what  would  be  expected  by  chance) 
K = 0.  When  there  is  complete  agreement,  K = 1. 

The  K statistic  is  most  often  used  when  there  is  no  ‘Gold  Stan- 
dard' at  all.  This  occurs,  for  example,  when  comparing  human  label- 
ers to  each  other  on  a difficult  subjective  task.  In  this  case,  K is  a very 
useful  evaluation  metric,  the  ‘average  pairwise  agreement  corrected 
for  chance  agreement’.  Krippendorf  (1980)  suggests  that  a value  of 
K > .8  can  be  considered  good  reliability. 


314 


Chapter  8.  Word  Classes  and  Part-of-Speech  Tagging 


can  be  computed  as  follows: 


P(w\c) 


CM 

C(c) 


P{Cj\Ci-l) 


C(cj-iCj) 

Lc-C(c,_ic) 


A class-based  /V-grarn  can  rely  on  standard  tagsets  like  the  Penn  tagset 
to  define  the  classes,  or  on  application-specific  sets  (for  example  using  tags 
like  CITY  and  AIRLINE  for  an  airline  information  system).  The  classes 
can  also  be  automatically  induced  by  clustering  words  in  a corpus  (Brown 
et  at,  1992).  A number  of  researchers  have  shown  that  class-based  /V-grams 
can  be  useful  in  decreasing  the  perplexity  and  word-error  rate  of  language 
models,  especially  if  they  are  mixed  in  some  way  with  regular  word-based 
/V-grams  (Jelinek,  1990;  Kneser  and  Ney,  1993;  Heeman,  1999;  Samuelsson 
and  Reichl,  1999). 


8.8  Summary 

This  chapter  introduced  the  idea  of  parts-of-speech  and  part-of-speech  tag- 
ging. The  main  ideas: 

• Languages  generally  have  a relatively  small  set  of  closed  class  words, 
which  arc  often  highly  frequent,  generally  act  as  function  words,  and 
can  be  very  ambiguous  in  their  part-of-speech  tags.  Open  class  words 
generally  include  various  kinds  of  nouns,  verbs,  adjectives.  There 
arc  a number  of  part-of-speech  coding  schemes,  based  on  tagsets  of 
between  40  and  200  tags. 

• Part-of-speech  tagging  is  the  process  of  assigning  a part-of-speech 
label  to  each  of  a sequence  of  words.  Taggers  can  be  characterized  as 
rule-based  or  stochastic.  Rule-based  taggers  use  hand-written  rules  to 
distinguish  tag  ambiguity.  Stochastic  taggers  arc  either  HMM-based, 
choosing  the  tag  sequence  which  maximizes  the  product  of  word  like- 
lihood and  tag  sequence  probability,  or  cue-based,  using  decision  frees 
or  maximum  entropy  models  to  combine  probabilistic  features. 

• Taggers  arc  often  evaluated  by  comparing  their  output  from  a test-set 
to  human  labels  for  that  test  set.  Error  analysis  can  help  pinpoint  areas 
where  a tagger  doesn’t  perform  well. 
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Bibliographical  and  Historical  Notes 

The  earliest  implemented  part-of-speech  assignment  algorithm  may  have 
been  paid  of  the  parser  in  Zellig  Harris’s  Transformations  and  Discourse 
Analysis  Project  (TDAP),  which  was  implemented  between  June  1958  and 
July  1959  at  the  University  of  Pennsylvania  (Harris,  1962).  Previous  nat- 
ural language  processing  systems  had  used  dictionaries  with  part-of-speech 
information  for  words,  but  have  not  been  described  as  performing  part-of- 
speech  disambiguation.  As  paid  of  its  parsing,  TDAP  did  paid  of  speech 
disambiguation  via  14  hand- written  rules,  whose  use  of  part-of-speech  tag 
sequences  prefigures  all  the  modern  algorithms,  and  which  were  run  in  an 
order  based  on  the  relative  frequency  of  tags  for  a word.  The  parser/tagger 
was  reimplemented  recently  and  is  described  by  Joshi  and  Hopely  (1999) 
and  Karttunen  (1999),  who  note  that  the  parser  was  essentially  implemented 
(ironically  in  a very  modern  way)  as  a cascade  of  finite-state  transducers. 

Soon  after  the  TDAP  parser  was  the  Computational  Grammar  Coder 
(CGC)  of  Klein  and  Simmons  (1963).  The  CGC  had  three  components:  a 
lexicon,  a morphological  analyzer,  and  a context  disambiguator.  The  small 
1500-word  lexicon  included  exceptional  words  that  could  not  be  accounted 
for  in  the  simple  morphological  analyzer,  including  function  words  as  well  as 
irregular  nouns,  verbs,  and  adjectives.  The  morphological  analyzer  used  in- 
flectional and  derivational  suffixes  to  assign  part-of-speech  classes.  A word 
was  run  through  the  lexicon  and  morphological  analyzer  to  produce  a can- 
didate set  of  parts-of-speech.  A set  of  500  context  rules  were  then  used  to 
disambiguate  this  candidate  set,  by  relying  on  surrounding  islands  of  unam- 
biguous words.  For  example,  one  rule  said  that  between  an  ARTICLE  and  a 
VERB,  the  only  allowable  sequences  were  ADJ-NOUN,  NOUN-ADVERB, 
or  NOUN-NOUN.  The  CGC  algorithm  reported  90%  accuracy  on  applying 
a 30-tag  tagset  to  articles  from  the  Scientific  American  and  a children’s  en- 
cyclopedia. 

The  TAGGIT  tagger  (Greene  and  Rubin,  1971)  was  based  on  the  Klein 
and  Simmons  (1963)  system,  using  the  same  architecture  but  increasing  the 
size  of  the  dictionary  and  the  size  of  the  tagset  (to  87  tags).  For  example  the 
following  sample  rule,  which  states  that  a word  x is  unlikely  to  be  a plural 
noun  (NNS)  before  a third  person  singular  verb  (VBZ): 

x VBZ  -A  not  NNS 

Taggit  was  applied  to  the  Brown  Coipus  and,  according  to  Francis 


316 


Chapter  8.  Word  Classes  and  Part-of-Speech  Tagging 


and  Kucera  (1982,  p.  9),  “resulted  in  the  accurate  tagging  of  77%  of  the 
corpus”  (the  remainder  of  the  Brown  Corpus  was  tagged  by  hand). 

In  the  1970’s  the  Lancaster-Oslo/Bergen  (LOB)  Corpus  was  compiled 
as  a British  English  equivalent  of  the  Brown  Corpus.  It  was  tagged  with 
the  CLAWS  tagger  (Marshall,  1983,  1987;  Garside,  1987),  a probabilis- 
tic algorithm  which  can  be  viewed  as  an  approximation  to  the  HMM  tag- 
ging approach.  The  algorithm  used  tag  bigram  probabilities,  but  instead 
of  storing  the  word-likelihood  of  each  tag,  tags  were  marked  either  as  rare 
(P(tagjword)  < .01)  infrequent  (P  (tag)  word)  < .10),  or  normally  frequent 
(P(tagjword)  > .10), 

The  probabilistic  PARTS  tagger  of  Church  (1988)  was  very  close  to  a 
full  HMM  tagger.  It  extended  the  CLAWS  idea  to  assign  full  lexical  prob- 
abilities to  each  word/tag  combination,  and  used  Viterbi  decoding  to  find  a 
tag  sequence.  Like  the  CLAWS  tagger,  however,  it  stored  the  probability  of 
the  tag  given  the  word: 

P(tag|word)  *P (tag [previous  n tags)  (8-11) 

rather  than  using  the  probability  of  the  word  given  the  tag,  as  an  HMM  tagger 
does: 


P(word|tag)  *P (tag [previous  n tags)  (8.12) 

Later  taggers  explicitly  introduced  the  use  of  the  Hidden  Markov  Model, 
often  with  the  EM  training  algorithm  (Kupiec,  1992;  Merialdo,  1994;  Wei- 
schedel  et  ah,  1993),  including  the  use  of  variable  length  Markov  models 
(Schiitze  and  Singer,  1994). 

A number  of  recent  stochastic  algorithms  use  various  statistical  and 
machine-learning  tools  to  estimate  the  probability  of  a tag  or  tag-sequence 
given  a large  number  of  relevant  features  such  as  the  neighboring  words  and 
neighboring  parts  of  speech,  as  well  as  assorted  orthographic  and  morpho- 
logical features.  These  features  arc  then  combined  to  estimate  the  probability 
of  tag  either  via  a decision  tree  (Jelinek  et  al.,  1994;  Magerman,  1995),  the 
Maximum  Entropy  algorithm  (Ratnaparkhi,  1996),  log-linear  models  (Franz, 
1996),  or  networks  of  linear  separators  (SNOW)  (Roth  and  Zelenko,  1998). 
Brill  (1997)  presents  a unsupervised  version  of  the  TBL  algorithm. 
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Exercises 


8.1  Find  one  tagging  error  in  each  of  the  following  sentences  that  arc 
tagged  with  the  Penn  Treebank  tagset: 

a.  I/PRP  nccd/VBP  a/DT  flight/NN  from/IN  Atlanta/NN 

b.  Docs/VBZ  this/DT  flight/NN  serve/VB  dinner/NNS 

c.  I/PRP  have/VB  a/DT  friend/NN  living/VBG  in/IN  Denver/NNP 

d.  What/WDT  flights/NNS  do/VBP  you/PRP  have/VB  from/IN  Milwau- 
kee/NNP  to/IN  Tampa/NNP 

e.  Can/VBP  you/PRP  list/VB  the/DT  nonstop/JJ  afternoon/NN  flights/NNS 

8.2  Use  the  Penn  Treebank  tagset  to  tag  each  word  in  the  following  sen- 
tences from  Damon  Runyon’s  short  stories.  You  may  ignore  punctuation. 
Some  of  these  arc  quite  difficult;  do  your  best. 

a.  It  is  a nice  night. 

b.  This  crap  game  is  over  a garage  in  Fifty-second  Street. . . 

c.  . . . Nobody  ever  takes  the  newspapers  she  sells  . . . 

d.  He  is  a tall,  skinny  guy  with  a long,  sad,  mean-looking  kisser,  and  a 
mournful  voice. 

e.  ...  I am  sitting  in  Mindy’s  restaurant  putting  on  the  gefillte  fish,  which 
is  a dish  I am  very  fond  of,  . . . 

f.  When  a guy  and  a doll  get  to  taking  peeks  back  and  forth  at  each  other, 
why  there  you  arc  indeed. 


8.3  Now  compare  your  tags  from  Exercise  1 with  one  or  two  friend’s  an- 
swers. On  which  words  did  you  disagree  the  most?  Why? 


8.4  Implement  the  Kappa  algorithm  of  page  313,  and  compute  the  agree- 
ment between  you  and  your  friends.  To  compute  P(E)  and  P(E),  you  may 
used  the  following  equations  modified  from  Walker  et  al.  (1997).  These  as- 
sume that  you  have  the  confusion  matrix  M,  where  the  correct  answers  label 
the  rows  and  the  hypotheses  label  the  columns  (as  seen  in  the  Methodology 
Box  on  page  311): 


P(E) 

P(A) 


It.  M(U) 

T 
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where  q is  the  sum  of  the  counts  in  row  i of  M , and  T is  the  sum  of  the  all 
the  counts  in  M, 

8.5  Now  tag  the  sentences  in  Exercise  8.2  using  the  more  detailed  C7  tagset 
in  Appendix  C. 

8.6  Implement  the  TBL  algorithm  in  Figure  8. 10.  Create  a small  number 
of  templates  and  train  the  tagger  on  any  POS-tagged  training  set  you  can 
find. 


8.7  Recall  that  the  Church  (1988)  tagger  is  not  an  HMM  tagger  since  it 
incorporates  the  probability  of  the  tag  given  the  word: 

P(tag|word)  *P (tag [previous  n tags)  (8.13) 

rather  than  using  the  likelihood  of  the  word  given  the  tag,  as  an  HMM 
tagger  does: 

P(word|tag)  *P (tag [previous  n tags)  (8.14) 

As  a gedanken-experiment,  construct  a sentence,  a set  of  tag  transi- 
tion probabilities,  and  a set  of  lexical  tag  probabilities  that  demonstrate  a 
way  in  which  the  HMM  tagger  can  produce  a better  answer  than  the  Church 
tagger. 

8.8  Build  an  HMM  tagger.  This  requires  (1)  that  you  have  implemented  the 
Viterbi  algorithm  from  Chapter  5 or  Chapter  7,  (2)  that  you  have  a dictionary 
with  part-of-speech  information  and  (3)  that  you  have  either  (a)  a part-of- 
speech-tagged  corpus  or  (b)  an  implementation  of  the  Forward  Backward 
algorithm.  If  you  have  a labeled  corpus,  train  the  transition  and  observation 
probabilities  of  an  HMM  tagger  directly  on  the  hand-tagged  data.  If  you 
have  an  unlabeled  corpus,  train  using  Forward  Backward. 

8.9  Now  run  your  algorithm  on  a small  test  set  that  you  have  hand-labeled. 
Find  five  errors  and  analyze  them. 


CONTEXT-FREE 
GRAMMARS  FOR 
ENGLISH 


Sentence 


NP  VP 


took  the  book 


The  first  context-free  grammar  parse  tree  (Chomsky,  1956) 

If  on  a winter’s  night  a traveler  by  Italo  Calvino 
Nuclear  and  Radiochemistry  by  Gerhart  Friedlander  et  al. 

The  Fire  Next  Time  by  James  Baldwin 

A Tad  Overweight,  but  Violet  Eyes  to  Die  For  by  G.  B.  Trudeau 
Sometimes  a Great  Notion  by  Ken  Kesey 
Dancer  from  the  Dance  by  Andrew  Holleran 

6 books  in  English  whose  titles  are  not  constituents, 
from  Pullum  (1991,  p.  195) 


In  her  essay  The  Anatomy  of  a Recipe,  M.  F.  K.  Fisher  (1968)  wryly  com- 
ments that  it  is  “modish”  to  refer  to  the  anatomy  of  a thing  or  problem.  The 
similar  use  of  grammar  to  describe  the  structures  of  an  area  of  knowledge 
had  a vogue  in  the  19th  century  (e.g.  Busby’s  (1818)  A Grammar  of  Mu- 
sic and  Field’s  (1888)  A Grammar  of  Colouring).  In  recent  years  the  word 
grammar  has  made  a reappearance,  although  usually  now  it  is  the  grammar 
rather  than  a grammar  that  is  being  described  (e.g.  The  Grammar  of  Graph- 
ics, The  Grammar  of  Conducting).  Perhaps  scholars  are  simply  less  modest 
than  they  used  to  be?  Or  perhaps  the  word  grammar  itself  has  changed  a 
bit,  from  ‘a  listing  of  principles  or  structures’,  to  ‘those  principles  or  struc- 


320 


Chapter  9.  Context-Free  Grammars  for  English 


SYNTAX 


CON- 

STITUENT 


tures  as  an  field  of  inquiry’.  Following  this  second  reading,  in  this  chapter 
we  turn  to  what  might  be  called  The  Grammar  of  Grammar,  or  perhaps  The 
Grammar  of  Syntax. 

The  word  syntax  comes  from  the  Greek  syntaxis,  meaning  ‘setting 
out  together  or  arrangement’,  and  refers  to  the  way  words  arc  arranged  to- 
gether. We  have  seen  various  syntactic  notions  in  previous  chapters.  Chap- 
ter 8 talked  about  part-of-speech  categories  as  a kind  of  equivalence  class  for 
words.  Chapter  6 talked  about  the  importance  of  modeling  word  order.  This 
chapter  and  the  following  ones  introduce  a number  of  more  complex  no- 
tions of  syntax  and  grammar.  There  arc  three  main  new  ideas:  constituency, 
grammatical  relations,  and  subcategorization  and  dependencies. 

The  fundamental  idea  of  constituency  is  that  groups  of  words  may  be- 
have as  a single  unit  or  phrase,  called  a constituent.  For  example  we  will 
see  that  a group  of  words  called  a noun  phrase  often  acts  as  a unit;  noun 
phrases  include  single  words  like  she  or  Michael  and  phrases  like  the  house, 
Russian  Hill,  and  a well-weathered  three-story  structure.  This  chapter  will 
introduce  the  use  of  context-free  grammars,  a formalism  that  will  allow  us 
to  model  these  constituency  facts. 

Grammatical  relations  are  a formalization  of  ideas  from  traditional 
grammar  about  SUBJECTS  and  OBJECTS.  In  the  sentence: 

(9. 1)  She  ate  a mammoth  breakfast. 

the  noun  phrase  She  is  the  SUBJECT  and  a mammoth  breakfast  is  the  OBJECT. 
Grammatical  relations  will  be  introduced  in  this  chapter  when  we  talk  about 
syntactic  agreement,  and  will  be  expanded  upon  in  Chapter  1 1 . 

Subcategorization  and  dependency  relations  refer  to  certain  kinds 
of  relations  between  words  and  phrases.  For  example  the  verb  want  can  be 
followed  by  an  infinitive,  as  in  I want  to  fly  to  Detroit,  or  a noun  phrase,  as  in 
I want  a flight  to  Detroit.  But  the  verb  find  cannot  be  followed  by  an  infinitive 
(*/ found  to  fly  to  Dallas).  These  are  called  facts  about  the  subcategory  of  the 
verb,  which  will  be  discussed  starting  on  page  337,  and  again  in  Chapter  11. 

All  of  these  kinds  of  syntactic  knowledge  can  be  modeled  by  various 
kinds  of  grammars  that  arc  based  on  context-free  grammars.  Context-free 
grammars  arc  thus  the  backbone  of  many  models  of  the  syntax  of  natu- 
ral language  (and,  for  that  matter,  of  computer  languages).  As  such  they 
arc  integral  to  most  models  of  natural  language  understanding,  of  grammar 
checking,  and  more  recently  of  speech  understanding.  They  arc  powerful 
enough  to  express  sophisticated  relations  among  the  words  in  a sentence,  yet 
computationally  tractable  enough  that  efficient  algorithms  exist  for  parsing 
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sentences  with  them  (as  we  will  see  in  Chapter  10).  Later  in  Chapter  12  we 
will  introduce  probabilistic  versions  of  context-free  grammars,  which  model 
many  aspects  of  human  sentence  processing  and  which  provide  sophisticated 
language  models  for  speech  recognition. 

In  addition  to  an  introduction  to  the  grammar  formalism,  this  chapter 
also  provides  an  overview  of  the  grammar  of  English.  We  will  be  modeling 
example  sentences  from  the  Air  Traffic  Information  System  (ATIS)  domain 
(Hemphill  et  al,  1990).  ATIS  systems  arc  spoken  language  systems  that 
can  help  book  airline  reservations.  Users  tty  to  book  flights  by  conversing 
with  the  system,  specifying  constraints  like  I’d  like  to  fly  from  Atlanta  to 
Denver.  The  government  funded  a number  of  different  research  sites  across 
the  country  to  build  ATIS  systems  in  the  early  90’s,  and  so  a lot  of  data  was 
collected  and  a significant  amount  of  research  has  been  done  on  the  resulting 
data.  The  sentences  we  will  be  modeling  in  this  chapter  arc  the  user  queries 
to  the  system. 


9.1  Constituency 

How  do  words  group  together  in  English?  How  do  we  know  they  arc  re- 
ally grouping  together?  Let’s  consider  the  standard  grouping  that  is  usually 
called  the  noun  phrase  or  sometimes  the  noun  group.  This  is  a sequence 
of  words  surrounding  at  least  one  noun.  Here  are  some  examples  of  noun 
phrases  (thanks  to  Damon  Runyon): 

three  parties  from  Brooklyn 
a high-class  spot  such  as  Mindy’s 
the  Broadway  coppers 
they 

Harry  the  Horse 

the  reason  he  comes  into  the  Hot  Box 

How  do  we  know  that  these  words  group  together  (or  ‘form  a con- 
stituent’)? One  piece  of  evidence  is  that  they  can  all  appeal-  in  similar  syn- 
tactic environments,  for  example  before  a verb. 

three  parties  from  Brooklyn  arrive. . . 
a high-class  spot  such  as  Mindy’s  attracts. . . 
the  Broadway  coppers  love. . . 
they  sit 


NOUN 

PHRASE 

NOUN  GROUP 
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But  while  the  whole  noun  phrase  can  occur  before  a verb,  this  is  not 
true  of  each  of  the  individual  words  that  make  up  a noun  phrase.  The  follow- 
ing arc  not  grammatical  sentences  of  English  (recall  that  we  use  an  asterisk 
(*)  to  mark  fragments  that  are  not  grammatical  English  sentences): 

*from  arrive. . . 

*as  attracts. . . 

*the  is. . . 

*spot  is. . . 

Thus  in  order  to  correctly  describe  facts  about  the  ordering  of  these 
words  in  English,  we  must  be  able  to  say  things  like  “Noun  Phrases  can 
occur  before  verbs”. 

Other  kinds  of  evidence  for  constituency  come  from  what  are  called 
preposed  preposed  or  postposed  constructions.  For  example,  the  prepositional  phrase 
postposed  on  September  seventeenth  can  be  placed  in  a number  of  different  locations  in 
the  following  examples,  including  preposed  at  the  beginning,  and  postposed 
at  the  end: 

On  September  seventeenth,  I’d  like  to  fly  from  Atlanta  to  Denver 

I’d  like  to  fly  on  September  seventeenth  from  Atlanta  to  Denver 

I’d  like  to  fly  from  Atlanta  to  Denver  on  September  seventeenth 


But  again,  while  the  entire  phrase  can  be  placed  differently,  the  indi- 
vidual words  making  up  the  phrase  cannot  be: 

*On  September,  I'd  like  to  fly  seventeenth  from  Atlanta  to  Denver 
*On  I'd  like  to  fly  September  seventeenth  from  Atlanta  to  Denver 
*I’d  like  to  fly  on  September  from  Atlanta  to  Denver  seventeenth 


Section  9.11  will  give  other  motivations  for  context-free  grammars 
based  on  their  ability  to  model  recursive  structures. 

There  are  many  other  kinds  of  evidence  that  groups  of  words  often 
behave  as  a single  constituent  (see  Radford  (1988)  for  a good  survey). 


9.2  Context-Free  Rules  and  Trees 

The  most  commonly  used  mathematical  system  for  modeling  constituent 
structure  in  English  and  other  natural  languages  is  the  Context-Free  Gram- 
cfg  mar,  or  CFG.  Context-free  grammars  arc  also  called  Phrase-Structure 
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Grammars,  and  the  formalism  is  equivalent  to  what  is  also  called  Backus- 
Naur  Form  or  BNF.  The  idea  of  basing  a grammar  on  constituent  structure 
dates  back  to  the  psychologist  Wilhelm  Wundt  (1900),  but  was  not  formal- 
ized until  Chomsky  (1956),  and,  independently.  Backus  (1959). 

A context-free  grammar  consists  of  a set  of  rules  or  productions,  each 
of  which  expresses  the  ways  that  symbols  of  the  language  can  be  grouped 
and  ordered  together,  and  a lexicon  of  words  and  symbols.  For  example, 
the  following  productions  expresses  that  a NP  (or  noun  phrase),  can  be 
composed  of  either  a ProperNoun  or  of  a determiner  ( Det)  followed  by  a 
Nominal,  a Nominal  can  be  one  or  more  Nouns. 


NP 

->  Det  Nominal 

(9.2) 

NP  - 

-*  ProperNoun 

(9.3) 

Nominal  - 

->  Noun  Noun  Nominal 

(9.4) 

Context  free  rules  can  be  hierarchically  embedded,  so  we  could  com- 
bine the  previous  rule  with  others  like  these  which  express  facts  about  the 
lexicon: 


Det 

->•  a 

(9.5) 

Det 

the 

(9.6) 

Noun  - 

flight 

(9.7) 

The  symbols  that  arc  used  in  a CFG  arc  divided  into  two  classes.  The 
symbols  that  correspond  to  words  in  the  language  (‘the’,  ‘nightclub’)  arc 
called  terminal  symbols;  the  lexicon  is  the  set  of  rules  that  introduce  these 
terminal  symbols.  The  symbols  that  express  clusters  or  generalizations  of 
these  are  called  nonterminals.  In  each  context-free  rule,  the  item  to  the  right 
of  the  arrow  (— »)  is  an  ordered  list  of  one  or  more  terminals  and  nontermi- 
nals, while  to  the  left  of  the  arrow  is  a single  nonterminal  symbol  expressing 
some  cluster  or  generalization.  Notice  that  in  the  lexicon,  the  nonterminal 
associated  with  each  word  is  its  lexical  category,  or  part-of-speech,  which 
we  defined  in  Chapter  8. 

A CFG  is  usually  thought  of  in  two  ways:  as  a device  for  generating 
sentences,  or  as  a device  for  assigning  a structure  to  a given  sentence.  As  a 
generator,  we  could  read  the  — > arrow  as  ‘rewrite  the  symbol  on  the  left  with 
the  string  of  symbols  on  the  right’.  So  starting  from  the  symbol 

NP, 
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NP 
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we  can  use  rule  9.2  to  rewrite  NP  as 
Det  Nominal, 
and  then  rule  9.4: 

DERIVED 

DERIVATION 
PARSE  TREE 


Det  Noun, 

and  finally  via  rules  9.5  and  9.7  as 
a flight. 

We  say  the  string  a flight  can  be  derived  from  the  nonterminal  NP. 
Thus  a CFG  can  be  used  to  randomly  generate  a series  of  strings.  This 
sequence  of  rule  expansions  is  called  a derivation  of  the  string  of  words. 
It  is  common  to  represent  a derivation  by  a parse  tree  (commonly  shown 
inverted  with  the  root  at  the  top).  Here  is  the  tree  representation  of  this 
derivation: 


START 

SYMBOL 


VERB 

PHRASE 


The  formal  language  defined  by  a CFG  is  the  set  of  strings  that  arc 
derivable  from  the  designated  start  symbol.  Each  grammar  must  have  one 
designated  start  symbol,  which  is  often  called  S.  Since  context-free  gram- 
mar's are  often  used  to  define  sentences,  S is  usually  interpreted  as  the  ‘sen- 
tence’ node,  and  the  set  of  strings  that  are  derivable  from  S is  the  set  of 
sentences  in  some  simplified  version  of  English. 

Let’s  add  to  our  sample  grammar  a couple  of  higher-level  rules  that 
expand  S,  and  a couple  others.  One  will  express  the  fact  that  a sentence  can 
consist  of  a noun  phrase  and  a verb  phrase: 


S — > NP  VP  I prefer  a morning  flight 

A verb  phrase  in  English  consists  of  a verb  followed  by  assorted  other 
things;  for  example,  one  kind  of  verb  phrase  consists  of  a verb  followed  by 
a noun  phrase: 
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VP  — > Verb  NP  prefer  a morning  flight 
Or  the  verb  phrase  may  have  a noun  phrase  and  a prepositional  phrase: 

VP  — > Verb  NP  PP  leave  Boston  in  the  morning 
Or  the  verb  may  be  followed  just  by  a preposition-phrase: 

VP  — > Verb  PP  leaving  on  Thursday 

A prepositional  phrase  generally  has  a preposition  followed  by  a noun 
phrase.  For  example,  a very  common  type  of  prepositional  phrase  in  the 
ATIS  corpus  is  used  to  indicate  location  or  direction: 

PP  — > Preposition  NP  from  Los  Angeles 

The  NP  inside  a PP  need  not  be  a location;  PPs  are  often  used  with 
times  and  dates,  and  with  other  nouns  as  well;  they  can  be  arbitrarily  com- 
plex. Here  are  ten  examples  from  the  ATIS  corpus: 

to  Seattle  on  these  flights 

in  Minneapolis  about  the  ground  transportation  in  Chicago 

on  Wednesday  of  the  round  trip  flight  on  United  Airlines 

in  the  evening  of  the  AP  fifty  seven  flight 

on  the  ninth  of  July  with  a stopover  in  Nashville 

Figure  9.2  gives  a sample  lexicon  and  Figure  9.3  summarizes  the  gram- 
mar rules  we’ve  seen  so  far,  which  we’ll  call  Lq.  Note  that  we  can  use  the 
or-symbol  | to  indicate  that  a non-terminal  has  alternate  possible  expansions. 

We  can  use  this  grammar  to  generate  sentences  of  this  ATIS-language’. 

We  start  with  S,  expand  it  to  NP  VP , then  choose  a random  expansion  of  NP 
(let’s  say  to  /),  and  a random  expansion  of  VP  (let’s  say  to  Verb  NP),  and  so 
on  until  we  generate  the  string  I prefer  a morning  flight.  Figure  9.4  shows  a 
parse  tree  that  represents  a complete  derivation  of  I prefer  a morning  flight. 

It  is  sometimes  convenient  to  represent  a parse  tree  in  a more  compact 
format  called  bracketed  notation,  essentially  the  same  as  LISP  tree  repre-  notation 
sentation;  here  is  the  bracketed  representation  of  the  parse  tree  of  Figure  9.4: 

[5  I NP  [pro  !]]  I VP  iv  prefer]  [NP  [Det  a]  [Nom  \N  morning]  [N  flight]]]]] 

A CFG  like  that  of  U)  defines  a formal  language.  We  saw  in  Chapter  2 
that  a formal  language  is  a set  of  strings.  Sentences  (strings  of  words)  that 
can  be  derived  by  a grammar  arc  in  the  formal  language  defined  by  that 
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Conjunction 


flights  | breeze  | trip  \ morning  \ . . 
is  | prefer  j like  \ need  \ want  \ fly 
cheapest  \ non-stop  \ first  \ latest 
other  | direct  | ... 
me  | 1 1 you  \ it  j ... 

Alaska  \ Baltimore  \ Los  Angeles 
Chicago  | United  \ American  \ ... 
the  | a | an  \ this  \ these  \ that  \ ... 
from  | to  | on  \ near  \ ... 
and  I or  I but  I ... 


Figure  9.2  The  lexicon  for  Lq. 


s - 

->  NP  VP 

I + want  a morning  flight 

NP  - 

->  Pronoun 

I 

Proper-Noun 

Los  Angeles 

Det  Nominal 

a + flight 

Nominal  - 

->  Noun  Nominal 

morning  + flight 

Noun 

flights 

VP  - 

->  Verb 

do 

Verb  NP 

want  + a flight 

Verb  NP  PP 

leave  + Boston  + in  the  morning 

Verb  PP 

leaving  + on  Thursday 

PP  - 

->  Preposition  NP 

from  + Los  Angeles 

Figure  9.3  The  grammar  for  Lq,  with  example  phrases  for  each  rule. 


grammar,  and  are  called  grammatical  sentences.  Sentences  that  cannot  be 
derived  by  a given  formal  grammar  are  not  in  the  language  defined  by  that 
grammar,  and  are  referred  to  as  ungrammatical.  This  hard  line  between 
‘in’  and  ‘out’  characterizes  all  formal  languages  but  is  only  a very  simplified 
model  of  how  natural  languages  really  work.  This  is  because  determining 
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whether  a given  sentence  is  paid  of  a given  natural  language  (say  English) 
often  depends  on  the  context.  In  linguistics,  the  use  of  formal  languages  to 
model  natural  languages  is  called  generative  grammar,  since  the  language 
is  defined  by  the  set  of  possible  sentences  ‘generated’  by  the  grammar. 

We  conclude  this  section  by  way  of  summary  with  a quick  formal  de- 
scription of  a context  free  grammar  and  the  language  it  generates.  A context- 
free  grammar  has  four  parameters  (technically  ‘is  a 4-tuple’): 

1.  a set  of  non-terminal  symbols  (or  ‘variables’)  N 

2.  a set  of  terminal  symbols  £ (disjoint  from  N ) 

3.  a set  of  productions  P,  each  of  the  form  A — > a,  where  A is  a non- 
terminal and  a is  a string  of  symbols  from  the  infinite  set  of  strings 
(£UA0*. 

4.  a designated  staid  symbol  S 

A language  is  defined  via  the  concept  of  derivation.  One  string  de- 
rives another  one  if  it  can  be  rewritten  as  the  second  one  via  some  series  of 
rule  applications.  More  formally,  following  Hopcroft  and  Ullman  (1979),  if 
A — > P is  a production  of  P and  a and  y arc  any  strings  in  the  set  (£  IJ  /V)*, 
then  we  say  that  aAy  directly  derives  oeffy,  or  aAy  =4-  cxfly.  Derivation  is 
then  a generalization  of  direct  derivation.  Let  oti , 0C2-  • . . , am  be  strings  in 
(IUA^)*,m  > 1,  such  that 

ai  =mx2,0C2  =>a3,...tam_i  =4  am  (9.8) 

We  say  that  ai  derives  a,,,,  or  aj  4>  am. 
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We  can  then  formally  define  the  language  Lq  generated  by  a grammar 
G as  the  set  of  strings  composed  of  terminal  symbols  which  can  be  derived 
from  the  designed  start  symbol  S. 

Lq  = W | w is  in  £ * and  S 4>  w (9.9) 

The  problem  of  mapping  from  a string  of  words  to  its  parse  tree  is 
parsing  called  parsing;  we  will  define  algorithms  for  parsing  in  Chapter  10  and  in 
Chapter  12. 

9.3  Sentence-Level  Constructions 

The  remainder  of  this  chapter  will  introduce  a few  of  the  more  complex  as- 
pects of  the  phrase  structure  of  English;  for  consistency  we  will  continue 
to  focus  on  sentences  from  the  ATIS  domain.  Because  of  space  limitations, 
our  discussion  will  necessarily  be  limited  to  highlights.  Readers  arc  strongly 
advised  to  consult  Quirk  et  al.  (1985a),  which  is  by  far  the  best  current  ref- 
erence grammar  of  English. 

In  the  small  grammar  Lq,  we  only  gave  a single  sentence-level  con- 
struction for  declarative  sentences  like  I prefer  a morning  flight.  There  are 
a great  number  of  possible  overall  sentence  structures,  but  4 arc  particularly 
common  and  important:  declarative  structure,  imperative  structure,  yes-no- 
question structure,  and  wh-question  structure, 
declarative  Sentences  with  declarative  structure  have  a subject  noun  phrase  fol- 

lowed by  a verb  phrase,  like  ‘I  prefer  a morning  flight’.  Sentences  with  this 
structure  have  a great  number  of  different  uses  that  we  will  follow  up  on  in 
Chapter  19.  Here  are  a number  of  examples  from  the  ATIS  domain: 

The  flight  should  be  eleven  a.m  tomorrow 

I need  a flight  to  Seattle  leaving  from  Baltimore  making  a stop  in  Min- 
neapolis 

The  return  flight  should  leave  at  around  seven  p.m 
I would  like  to  find  out  the  flight  number  for  the  United  flight  that  ar- 
rives in  San  Jose  around  ten  p.m 
I’d  like  to  fly  the  coach  discount  class 
I want  a flight  from  Ontario  to  Chicago 
I plan  to  leave  on  July  first  around  six  thirty  in  the  evening 


IMPERATIVE 


Sentences  with  imperative  structure  often  begin  with  a verb  phrase. 
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and  have  no  subject.  They  arc  called  imperative  because  they  arc  almost 
always  used  for  commands  and  suggestions;  in  the  ATIS  domain  they  arc 
commands  to  the  system. 

Show  the  lowest  fare 

Show  me  the  cheapest  fare  that  has  lunch 

Give  me  Sunday’s  flights  arriving  in  Las  Vegas  from  Memphis  and 
New  York  City 

List  all  flights  between  five  and  seven  p.m 
List  all  flights  from  Burbank  to  Denver 

Show  me  all  flights  that  depart  before  ten  a.m  and  have  first  class  fares 

Show  me  all  the  flights  leaving  Baltimore 

Show  me  flights  arriving  within  thirty  minutes  of  each  other 

Please  list  the  flights  from  Charlotte  to  Long  Beach  arriving  after  lunch 

time 

Show  me  the  last  flight  to  leave 

To  model  this  kind  of  sentence  structure,  we  can  add  another  rule  for  the 
expansion  of  S: 

S — > VP  Show  the  lowest  fare 

Sentences  with  yes-no-question  structure  arc  often  (though  not  al- 
ways) used  to  ask  questions  (hence  the  name),  and  begin  with  a auxiliary 
verb,  followed  by  a subject  NP,  followed  by  a VP.  Here  arc  some  exam- 
ples (note  that  the  third  example  is  not  really  a question  but  a command  or 
suggestion;  Chapter  19  will  discuss  the  pragmatic  uses  of  these  question 
forms): 

Do  any  of  these  flights  have  stops? 

Does  American’s  flight  eighteen  twenty  five  serve  dinner? 

Can  you  give  me  the  same  information  for  United? 

Here’s  the  rule: 

S ->•  Aux  NP  VP 

The  most  complex  of  the  sentence-level  structures  we  will  examine 
are  the  various  wh-  structures.  These  arc  so  named  because  one  of  their 
constituents  is  a wh-  phrase,  i.e.  one  that  includes  a wh-  word  (who,  where, 
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what,  which,  how,  why).  These  may  be  broadly  grouped  into  two  classes  of 
sentence-level  structures.  The  wh-subject-question  structure  is  identical  to 
the  declarative  structure,  except  that  the  first  noun  phrase  contains  some  wh- 
word. 


What  airlines  fly  from  Burbank  to  Denver? 

Which  flights  depart  Burbank  after  noon  and  arrive  in  Denver  by  six 
p.m? 

Which  flights  serve  breakfast? 

Which  of  these  flights  have  the  longest  layover  in  Nashville? 


Here  is  a rule.  Exercise  9. 10  discusses  rules  for  the  constituents  that  make 
up  the  Wh-NP. 

S ->  Wh-NP  VP 

In  the  wh-non-subject-question  structure,  the  wh-phrase  is  not  the 
subject  of  the  sentence,  and  so  the  sentence  includes  another  subject.  In 
these  types  of  sentences  the  auxiliary  appeal's  before  the  subject  NP,  just  as 
in  the  yes-no-question  structures.  Here  is  an  example: 

What  flights  do  you  have  from  Burbank  to  Tacoma  Washington? 


Here  is  a sample  rule: 

S ->  Wh-NP  AuxNP  VP 

There  are  other  sentence-level  structures  we  won’t  try  to  model  here, 
like  fronting,  in  which  a phrase  is  placed  at  the  beginning  of  the  sentence  for 
various  discourse  purposes  (for  example  often  involving  topicalization  and 
focus): 

On  Tuesday,  I’d  like  to  fly  from  Detroit  to  Saint  Petersburg 


The  Noun  Phrase 

We  can  view  the  noun  phrase  as  revolving  around  a head,  the  central  noun 
in  the  noun  phrase.  The  syntax  of  English  allows  for  both  prenominal  (pre- 
head) modifiers  and  post-nominal  (post-head)  modifiers. 


Section  9.4. 


The  Noun  Phrase 


331 


Before  the  Head  Noun 

We  have  already  discussed  some  of  the  parts  of  the  noun  phrase;  the  deter- 
miner, and  the  use  of  the  Nominal  constituent  for  representing  double  noun 
phrases.  We  have  seen  that  noun  phrases  can  begin  with  a determiner,  as 
follows: 

a stop 
the  flights 
that  fare 
this  flight 
those  flights 
any  flights 
some  flights 

There  arc  certain  circumstances  under  which  determiners  arc  optional 
in  English.  For  example,  determiners  may  be  omitted  if  the  noun  they  modify 
is  plural: 

Show  me  flights  from  San  Francisco  to  Denver  on  weekdays 

As  we  saw  in  Chapter  8,  mass  nouns  don’t  require  determination.  Re- 
call that  mass  nouns  often  (not  always)  involve  something  that  is  treated  like 
a substance  (including  e.g.  water  and  snow),  don’t  take  the  indefinite  article 
‘a’,  and  don’t  tend  to  pluralize.  Many  abstract  nouns  arc  mass  nouns  {music, 
homework).  Mass  nouns  in  the  ATIS  domain  include  breakfast,  lunch,  and 
dinner. 

Does  this  flight  serve  dinner? 

Exercise  9.4  asks  the  reader  to  represent  this  fact  in  the  CFG  formalism. 

Word  classes  that  appeal-  in  the  NP  before  the  determiner  are  called 
predeterminers.  Many  of  these  have  to  do  with  number  or  amount;  a com-  minereser 
mon  predeterminer  is  all: 

all  the  flights 
all  flights 

A number  of  different  kinds  of  word  classes  can  appeal-  in  the  NP  be- 
tween the  determiner  and  the  head  noun  (the  ’post-determiners’).  These 
include  cardinal  numbers,  ordinal  numbers,  and  quantifiers.  Examples  numbers 
of  cardinal  numbers:  numrfrs 


QUANTIFIERS 
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ADJECTIVE 

PHRASE 

AP 


two  friends 
one  stop 

Ordinal  numbers  include  first , second , third , etc,  but  also  words  like 
next,  last,  past,  other,  and  another. 

the  first  one 

the  next  day 

the  second  leg 

the  last  flight 

the  other  American  flight 

any  other  fares 

Some  quantifiers  (many,  (a)  few,  several)  occur  only  with  plural  count  nouns: 
many  fares 

The  quantifiers  much  and  a little  occur  only  with  noncount  nouns. 

Adjectives  occur  after  quantifiers  but  before  nouns. 

& first-class  fare 
a nonstop  flight 
the  longest  layover 
the  earliest  lunch  flight 

Adjectives  can  also  be  grouped  into  a phrase  called  an  adjective  phrase 
or  AP.  APs  can  have  an  adverb  before  the  adjective  (see  Chapter  8 for  defi- 
nitions of  adjectives  and  adverbs): 

the  least  expensive  fare 

We  can  combine  all  the  options  for  prenominal  modifiers  with  one  rule  as 
follows: 

NP  — » (Det)  (Card)  (Ord)  (Quant)  (AP)  Nominal  (9.10) 

This  simplified  noun  phrase  rule  has  a flatter  structure  and  hence  is 
simpler  than  most  modern  theories  of  grammar.  We  present  this  simplified 
rule  because  there  is  no  universally  agreed-upon  internal  constituency  for  the 
noun  phrase. 

Note  the  use  of  parentheses  ()  to  mark  optional  constituents.  A rule 
with  one  set  of  parentheses  is  really  a shorthand  for  two  rules,  one  with  the 
parentheses,  one  without. 
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After  the  Noun 

A head  noun  can  be  followed  by  postmodifiers.  Three  kinds  of  nominal 
postmodifiers  arc  very  common  in  English: 

prepositional  phrases  all  flights  from  Cleveland 
non-finite  clauses  any  flights  arriving  after  eleven  a.m. 
relative  clauses  a flight  that  serves  breakfast 

Prepositional  phrase  postmodifiers  arc  particularly  common  in  the  ATIS 
corpus,  since  they  arc  used  to  mark  the  origin  and  destination  of  flights.  Here 
arc  some  examples,  with  brackets  inserted  to  show  the  boundaries  of  each 
PP;  note  that  more  than  one  PP  can  be  strung  together: 

any  stopovers  [for  Delta  seven  fifty  one] 
all  flights  [from  Cleveland ] [to  Newark] 
arrival  [in  San  Jose]  [before  seven  p.m] 
a reservation  [on  flight  six  oh  six [ [from  Tampa]  [to  Montreal] 


Here’s  a new  NP  rule  to  account  for  one  to  three  PP  postmodifiers: 

Nominal  —t  Nominal  PP  {PP)  {PP) 

The  three  most  common  kinds  of  non-finite  postmodifiers  arc  the  gerun- 
dive (- ing ),  -ed,  and  infinitive  forms. 

Gerundive  postmodifiers  arc  so-called  because  they  consist  of  a verb 
phrase  that  begins  with  the  gerundive  {-ing)  form  of  the  verb.  In  the  follow- 
ing examples,  the  verb  phrases  happen  to  all  have  only  prepositional  phrases 
after  the  verb,  but  in  general  this  verb  phrase  can  have  anything  in  it  (any- 
thing, that  is,  which  is  semantically  and  syntactically  compatible  with  the 
gerund  verb). 

any  of  those  (leaving  on  Thursday ) 

any  flights  (arriving  after  eleven  a.m) 

flights  (arriving  within  thirty  minutes  of  each  other ) 


We  can  define  the  NP  as  follows,  making  use  of  a new  nonterminal  GerundVP : 


NON-FINITE 

GERUNDIVE 


Nominal  — > Nominal  GerundVP 
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We  can  make  rules  for  GerundVP  constituents  by  duplicating  all  of  our  VP 
productions,  substituting  GerundV  for  V. 

GerundVP  — > GerundV  NP 
| GerundV  PP 
j GerundV 
| GerundV  NP  PP 

GerundV  can  then  be  defined  as: 

GerundV  — > being  | prefering  | arriving  \ leaving  j ... 


The  phrases  in  italics  below  are  examples  of  the  two  other  common  kinds  of 
non-finite  clauses,  infinitives  and  -ed  forms: 

the  last  flight  to  arrive  in  Boston 

I need  to  have  dinner  served 

Which  is  the  aircraft  used  by  this  flight? 

A postnominal  relative  clause  (more  correctly  a restrictive  relative 
clause),  is  a clause  that  often  begins  with  a relative  pronoun  (that  and  who 
arc  the  most  common).  The  relative  pronoun  functions  as  the  subject  of  the 
embedded  verb  in  the  following  examples: 

a flight  that  setyes  breakfast 
flights  that  leave  in  the  morning 

the  United  flight  that  arrives  in  San  Jose  around  ten  p.m. 
the  one  that  leaves  at  ten  thirty  five 

We  might  add  rules  like  the  following  to  deal  with  these: 


Nominal  —t  Nominal  RelClause  (9.11) 

RelClause  — » (who  that)  VP  (9.12) 

(9.13) 

The  relative  pronoun  may  also  function  as  the  object  of  the  embedded 
verb,  as  in  the  following  example;  we  leave  as  an  exercise  for  the  reader 
writing  grammar  rules  for  more  complex  relative  clauses  of  this  kind. 
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the  earliest  American  Airlines  flight  that  I can  get 

Various  postnominal  modifiers  can  be  combined,  as  the  following  examples 
show: 


a flight  (from  Phoenix  to  Detroit ) (leaving  Monday  evening) 

I need  a flight  (to  Seattle)  (leaving  from  Baltimore)  (making  a 
stop  in  Minneapolis) 

evening  flights  (from  Nashville  to  Houston)  (that  serve  dinner) 
a friend  (living  in  Denver)  (that  would  like  to  visit  me  here  in 
Washington  DC) 


9.5  Coordination 

Noun  phrases  and  other  units  can  be  conjoined  with  conjunctions  like  and, 
or,  and  but.  For  example  a coordinate  noun  phrase  can  consist  of  two  other 
noun  phrases  separated  by  a conjunction  (we  used  brackets  to  mark  the  con- 
stituents): 

Please  repeat  [ ,y/j  (pjP  the  flights]  and  \ fjp  the  costs]] 

I need  to  know  [ ,y/j  [ ,\rP  the  aircraft]  and  [^P  flight  number]] 

I would  like  to  fly  from  Denver  stopping  in  [ ,y/j  [NP  Pittsburgh] 
and  | ,y/j  Atlanta]] 

Here’s  a new  rule  for  this: 

NP  ->  NPandNP  (9.14) 

In  addition  to  NPs,  most  other  kinds  of  phrases  can  be  conjoined  (for 
example  including  sentences,  VPs,  and  PPs): 

What  flights  do  you  have  (yP  \ \/P  leaving  Denver]  and  \ yp  arriv- 
ing in  San  Francisco]] 

[5  [5  I'm  interested  in  a flight  from  Dallas  to  Washington]  and  [5 
I'm  also  interested  in  going  to  Baltimore]] 

Similar  conjunction  rules  can  be  built  for  VP  and  S conjunction: 

(9.15) 

(9.16) 


VP  -»  VP  and  VP 
S SandS 


CONJUNC- 

TIONS 

COORDINATE 
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9.6  Agreement 

In  Chapter  3 we  discussed  English  inflectional  morphology.  Recall  that  most 
verbs  in  English  can  appeal-  in  two  forms  in  the  present  tense:  the  form  used 
for  third-person,  singular  subjects  {the  flight  does),  and  the  form  used  for  all 
other  kinds  of  subjects  {all  the  flights  do,  I do).  The  third-person-singular 
{3sg  form  usually  has  a final  -s  where  the  non-3sg  form  does  not.  Here  are 
some  examples,  again  using  the  verb  do,  with  various  subjects: 

You  [ ;//j  [y  said  [ <,  there  were  two  flights  that  were  the  cheapest  ]]] 

Do  [ ,y/j  any  flights]  stop  in  Chicago? 

Do  [ ,y/>  all  of  these  flights]  offer  first  class  service? 

Do  [ ,y/>  I]  get  dinner  on  this  flight? 

Do  [ ,y/j  you]  have  a flight  from  Boston  to  Forth  Worth? 

Does  [/vp  this  flight]  stop  in  Dallas? 

Does  [/vp  that  flight]  serve  dinner? 

Does  [/vp  Delta]  fly  from  Atlanta  to  San  Francisco? 

Here  are  more  examples  with  the  verb  leave: 

What  flights  leave  in  the  morning? 

What  flight  leaves  from  Pittsburgh? 

This  agreement  phenomenon  occurs  whenever  there  is  a verb  that  has 
some  noun  acting  as  its  subject.  Note  that  sentences  in  which  the  subject 
does  not  agree  with  the  verb  are  ungrammatical: 

*[What  flight]  leave  in  the  morning? 

*Does  [/vp  you]  have  a flight  from  Boston  to  Forth  Worth? 

*Do  [/vp  this  flight]  stop  in  Dallas? 

How  can  we  modify  our  grammar  to  handle  these  agreement  phenom- 
ena? One  way  is  to  expand  our  grammar  with  multiple  sets  of  rules,  one  rule 
set  for  3sg  subjects,  and  one  for  non-J.vy  subjects.  For  example,  the  rule  that 
handled  these  yes-no-questions  used  to  look  like  this: 

5 ->  Aux  NP  VP 

We  could  replace  this  with  two  rules  of  the  following  form: 

5 ->•  3sgAux  3sgNP  VP 
S — » Non3sgAux  Non3sgNP  VP 
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We  could  then  add  rules  for  the  lexicon  like  these: 

3sgAux  — > does  \ has  \ can  | ... 

Non3sgAux  — > do  \ have  \ can  | ... 

But  we  would  also  need  to  add  rules  for  3sgNP  and  Non3sgNP,  again 
by  making  two  copies  of  each  rule  for  NP.  While  pronouns  can  be  first, 
second,  or  third  person,  full  lexical  noun  phrases  can  only  be  third  person, 
so  for  them  we  just  need  to  distinguish  between  singular  and  plural: 


3SgNP 

Non3SgNP 

SgNominal 

PINominal 

SgNoun 

PINoun 


(Det)  (Card)  (Ord)  (Quant)  (AP)  SgNominal 

(Det)  (Card)  (Ord)  (Quant)  (AP)  PINominal 

SgNoun  | SgNoun  SgNoun 

PINoun  | SgNoun  PINoun 

flight  | fare  \ dollar  \ reservation  | ... 

flights  | fares  \ dollars  \ reservations  | ... 


Dealing  with  the  first  and  second  person  pronouns  is  left  as  an  exercise  for 
the  reader. 

A problem  with  this  method  of  dealing  with  number  agreement  is  that 
it  doubles  the  size  of  the  grammar.  Every  rule  that  refers  to  a noun  or  a verb 
needs  to  have  a 'singular’  version  and  a ‘plural’  version.  This  rule  prolif- 
eration will  also  have  to  happen  for  the  noun’s  case;  for  example  English 
pronouns  have  nominative  (I,  she,  he,  they)  and  accusative  (me,  her,  him, 
them)  versions.  We  will  need  new  versions  of  every  NP  and  N rule  for  each 
of  these. 

A more  significant  problem  occurs  in  languages  like  German  or  French, 
which  not  only  have  noun- verb  agreement  like  English,  but  also  have  gender 
agreement;  the  gender  of  a noun  must  agree  with  the  gender  of  its  modify- 
ing adjective  and  determiner.  This  adds  another  multiplier  to  the  rule  sets  of 
the  language. 

Chapter  1 1 will  introduce  a way  to  deal  with  these  agreement  problems 
without  exploding  the  size  of  the  grammar,  by  effectively  parameterizing 
each  nonterminal  of  the  grammar  with  feature  structures. 


CASE 

NOMINATIVE 

ACCUSATIVE 


GENDER 

AGREEMENT 


9.7  The  Verb  Phrase  and  Subcategorization 


The  verb  phrase  consists  of  the  verb  and  a number  of  other  constituents.  In 
the  simple  rules  we  have  built  so  far,  these  other  constituents  include  NP' s 
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and  PP' s and  combinations  of  the  two: 

VP  — > Verb  disappear 

VP  —>  Verb  NP  prefer  a morning  flight 

VP  —>  Verb  NP  PP  leave  Boston  in  the  morning 

VP  — > Verb  PP  leaving  on  Thursday 


Verb  phrases  can  be  significantly  more  complicated  than  this.  Many 
other  kinds  of  constituents  can  follow  the  verb,  such  as  an  entire  embedded 
comp|etial  sentence.  These  are  called  sentential  complements: 

You  [ \/p  \y  said  [5  there  were  two  flights  that  were  the  cheapest  ]]] 

You  | yp  [v  said  [5  you  had  a two  hundred  sixty  six  dollar  fare]] 

[ yp  [v  Tell]  | :y/->  me]  [5  how  to  get  from  the  airport  in  Philadelphia  to 
downtown]] 

\\yp\v  think  [5  I would  like  to  take  the  nine  thirty  flight]] 

Here’s  a rule  for  these: 

VP  -?  Verb  S 

Another  potential  constituent  of  the  VP  is  another  VP.  This  is  often  the 
case  for  verbs  like  want,  would  like,  try,  intend,  need : 

I want  [yp  to  fly  from  Milwaukee  to  Orlando] 

Hi,  I want  [yp  to  arrange  three  flights] 

Hello,  I'm  trying  [yp  to  find  a flight  that  goes  from  Pittsburgh  to  Den- 
ver after  two  PM 

Recall  from  Chapter  8 that  verbs  can  also  be  followed  by  particles, 
words  that  resemble  a preposition  but  that  combine  with  the  verb  to  form  a 
phrasal  verb  like  take  off).  These  particles  are  generally  considered  to  be 
an  integral  paid  of  the  verb  in  a way  that  other  post-verbal  elements  are  not; 
phrasal  verbs  are  treated  as  individual  verbs  composed  of  two  words. 

While  a verb  phrase  can  have  many  possible  kinds  of  constituents,  not 
every  verb  is  compatible  with  every  verb  phrase.  For  example,  the  verb  want 
can  either  be  used  with  an  NP  complement  (/  want  a flight. . . ),  or  with  an 
infinitive  VP  complement  (I  want  to  fly  to. . .).  By  contrast,  a verb  like  find 
cannot  take  this  sort  of  VP  complement.  (*  I found  to  fly  to  Dallas). 
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This  idea  that  verbs  arc  compatible  with  different  kinds  of  comple- 
ments is  a very  old  one;  traditional  grammar  distinguishes  between  tran- 
sitive verbs  lik e find,  which  take  a direct  object  NP  (I  found  a flight),  and 
intransitive  verbs  like  disappear,  which  do  not  ( *1  disappeared  a flight). 

Where  traditional  grammars  subcategorize  verbs  into  these  two  cate- 
gories (transitive  and  intransitive),  modern  grammars  distinguish  as  many  as 
100  subcategories.  (In  fact  tagsets  for  many  such  subcategorization  frames 
exists;  see  (Macleod  et  al,  1998)  for  the  COMLEX  tagset,  Sanfilippo  (1993) 
for  the  ACQUILEX  tagset,  and  further  discussion  in  Chapter  1 1).  We  say 
that  a verb  like  find  subcategorizes  for  an  NP,  while  a verb  like  want  sub- 
categorizes for  either  an  NP  or  an  infinite  VP.  We  also  call  these  constituents 
the  complements  of  the  verb  (hence  our  use  of  the  term  sentential  comple- 
ment above).  So  we  say  that  want  can  take  a VP  complement.  These  pos- 
sible sets  of  complements  are  called  the  subcategorization  frame  for  the 
verb.  Another  way  of  talking  about  the  relation  between  the  verb  and  these 
other  constituents  is  to  think  of  the  verb  as  a predicate  and  the  constituents 
as  arguments  of  the  predicate.  So  we  can  think  of  such  predicate-argument 
relations  as  FIND  (I,  A FLIGHT),  or  WANT  (I,  TO  FLY).  We  will  talk  more 
about  this  view  of  verbs  and  arguments  in  Chapter  14  when  we  talk  about 
predicate  calculus  representations  of  verb  semantics. 

Here  are  some  subcategorization  frames  and  example  verbs: 


TRANSITIVE 

INTRANSITIVE 

SUBCATEGO- 

RIZE 


SUBCATEGO- 

RIZATION 

FRAMES 


SUBCATEGO- 

RIZES 

FOR 


COMPLE- 

MENTS 


SUBCATEGO- 

RIZATION 

FRAME 


Frame 

0 

NP 

NPNP 
^from  PPto 

NPPPwlth 

VPto 

VPbrst 

S 


Verb  Example 

eat,  sleep  I want  to  eat 

prefer,  find,  leave.  Find  the  flight  from  Pittsburgh  to  Boston 
show,  give  Show  me  airlines  with  flights  from  Pittsburgh 

fly,  travel  I would  like  to  fly,  from  Boston  to  Philadelphia 

help,  load.  Can  you  help  [ (y/j  me]  [,y/j  with  a flight] 

prefer,  want,  need  I would  prefer  [ypto  to  go  by  United  airlines] 
can,  would,  might  I can  [ VPbrst  8°  from  Boston] 
mean  Does  this  mean  [5  AA  has  a hub  in  Boston]? 


Note  that  a verb  can  subcategorize  for  a particular  type  of  verb  phrase, 
such  as  a verb  phrase  whose  verb  is  an  infinitive  (VPto),  or  a verb  phrase 
whose  verb  is  a hare  stem  (uninflected:  VPbrst).  Note  also  that  a single  verb 
can  take  different  subcategorization  frames.  The  verb  find,  for  example,  can 
take  an  NP  NP  frame  (find  me  a flight)  as  well  as  an  NP  frame. 

How  can  we  represent  the  relation  between  verbs  and  their  comple- 
ments in  a context-free  grammar?  One  thing  we  could  do  is  to  do  what  we 
did  with  agreement  features:  make  separate  subtypes  of  the  class  Verb  (Verb- 
with-NP -complement  Verb-with-Inf -VP -complement  Verb-with-S-complement 
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Verb-with-NP -plus-P P -complement , and  so  on): 


Verb-with-NP -complement 
Verb-with-S-complement 
Verb-with-Inf -VP -complement 


find  | leave  \ repeat 
think  | believe  \ say 
want  | try  \ need  \ . 


Then  each  of  our  VP  rules  could  be  modified  to  require  the  appropriate  verb 
subtype: 

VP  — > Verb-with-no-complement  disappear 

VP  — >■  Verb-with-NP -comp  NP  prefer  a morning  flight 

VP  —>  Verb-with-S-comp  S said  there  were  two  flights 

The  problem  with  this  approach,  as  with  the  same  solution  to  the  agree- 
ment feature  problem,  is  a vast  explosion  in  the  number  of  rules.  The  stan- 
dard solution  to  both  of  these  problems  is  the  feature  structure,  which  will 
be  introduced  in  Chapter  1 1.  Chapter  1 1 will  also  discuss  the  fact  that  nouns, 
adjectives,  and  prepositions  can  subcategorize  for  complements  just  as  verbs 
can. 


9.8  Auxiliaries 

auxiliaries  The  subclass  of  verbs  called  auxiliaries  or  helping  verbs  have  particular 
syntactic  constraints  which  can  be  viewed  as  a kind  of  subcategorization. 
modal  Auxiliaries  include  the  modal  verbs  can,  could,  may,  might,  must,  will, 

perfect  would,  shall,  and  should,  the  perfect  auxiliary  have,  the  progressive  auxil- 

progres-  iary  be,  and  the  passive  auxiliary  be.  Each  of  these  verbs  places  a constraint 
passive  on  the  form  of  the  following  verb,  and  each  of  these  must  also  combine  in  a 
particular-  order. 

Modal  verbs  subcategorize  for  a VP  whose  head  verb  is  a bare  stem, 
e.g.  can  go  in  the  morning,  will  try  to  find  a flight.  The  perfect  verb  have 
subcategorizes  for  a VP  whose  head  verb  is  the  past  participle  form:  have 
booked  3 flights.  The  progressive  verb  be  subcategorizes  for  a VP  whose 
head  verb  is  the  gerundive  participle:  am  going  from  Atlanta.  The  passive 
verb  be  subcategorizes  for  a VP  whose  head  verb  is  the  past  participle:  was 
delayed  by  inclement  weather. 
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A sentence  can  have  multiple  auxiliary  verbs,  but  they  must  occur  in  a 
particular'  order: 

modal  < perfect  < progressive  < passive 

Here  are  some  examples  of  multiple  auxiliaries: 

modal  perfect  could  have  been  a contender 

modal  passive  will  be  married 

perfect  progressive  have  been  feasting 

modal  perfect  passive  might  have  been  prevented 

Auxiliaries  arc  often  treated  just  like  verbs  such  as  want,  seem , or  in- 
tend, which  subcategorize  for  particular-  kinds  of  VP  complements.  Thus 
can  would  be  listed  in  the  lexicon  as  a verb-with-bare-stem-VP-complement. 

One  way  of  capturing  the  ordering  constraints  among  auxiliaries,  commonly 
used  in  the  systemic  grammar  of  Halliday  (1985a),  is  to  introduce  a special  grammar 
constituent  called  the  verb  group,  whose  subconstituents  include  all  the  aux-  verb  group 
diaries  as  well  as  the  main  verb.  Some  of  the  ordering  constraints  can  also 
be  captured  in  a different  way.  Since  modals,  for  example,  do  not  having 
a progressive  or  participle  form,  they  simply  will  never  be  allow  to  follow 
progressive  or  passive  be  or  perfect  have.  Exercise  9.8  asks  the  reader  to 
write  grammar  rules  for  auxiliaries. 

The  passive  construction  has  a number  of  properties  that  make  it  differ- 
ent than  other  auxiliaries.  One  important  difference  is  a semantic  one;  while 
the  subject  of  non-passive  (active)  sentence  is  often  the  semantic  agent  of  active 
the  event  described  by  the  verb  (I  prevented  a catastrophe)  the  subject  of 
the  passive  is  often  the  undergoer  or  patient  of  the  event  (a  catastrophe  was 
prevented).  This  will  be  discussed  further  in  Chapter  15. 

9.9  Spoken  Language  Syntax 

The  grammar-  of  written  English  and  the  grammar  of  conversational  spoken 
English  share  many  features,  but  also  differ  in  a number  of  respects.  This 
section  gives  a quick  sketch  of  a number  of  the  characteristics  of  the  syntax 
of  spoken  English. 

We  usually  use  the  term  utterance  rather  than  sentence  for  the  units  utterance 
of  spoken  language.  Figure  9.5  shows  some  sample  spoken  ATIS  utterances 
that  exhibit  many  aspects  of  spoken  language  grammar. 

This  is  a standard  style  of  transcription  used  in  transcribing  speech 
corpora  for  speech  recognition.  The  comma  V marks  a short  pause,  each 
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the  . [exhale]  . . . [inhale]  . . [uh]  does  American  airlines  . offer  any 
. one  way  flights  . [uh]  one  way  fares,  for  one  hundred  and  sixty  one 
dollars 

[mm]  i'd  like  to  leave  i guess  between  [um]  . [smack]  . five  o’clock  no, 
five  o’clock  and  [uh],  seven  o’clock  . P M 
around,  four,  P M 

all  right,  [throat_clear]  . . i'd  like  to  know  the  . give  me  the  flight . times 
. in  the  morning  . for  September  twentieth  . nineteen  ninety  one 
[uh]  one  way 
[uh]  seven  fifteen,  please 

on  United  airlines  . . give  me,  the  . . time  . . from  New  York  . [smack] 

. to  Boise-,  to  . I’m  sorry  . on  United  airlines  . [uh]  give  me  the  flight, 
numbers,  the  flight  times  from  . [uh]  Boston  , to  Dallas 

Figure  9.5  Some  sample  spoken  utterances  from  users  interacting  with  the 
ATIS  system. 


PROSODY 

PITCH 

CONTOUR 

STRESS 

PATTERN 


period  V marks  a long  pause,  and  the  square  brackets  ‘[uh]’  mark  non- 
verbal events  (breaths,  lipsmacks,  uhs  and  urns). 

There  arc  a number  of  ways  these  utterances  differ  from  written  En- 
glish sentences.  One  is  in  the  lexical  statistics;  for  example  spoken  English 
is  much  higher  in  pronouns  than  written  English;  the  subject  of  a spoken 
sentence  is  almost  invariably  a pronoun.  Another  is  in  the  presence  of  var- 
ious kinds  of  disfluencies  (hesitations,  repairs,  restarts,  etc)  to  be  discussed 
below.  Spoken  sentences  often  consist  of  short  fragments  or  phrases  ( one 
way  or  around  four  PM,  which  arc  less  common  in  written  English. 

Finally,  these  sentences  were  spoken  with  a particular  prosody.  The 
prosody  of  an  utterance  includes  its  particular  pitch  contour  (the  rise  and 
fall  of  the  fundamental  frequency  of  the  soundwave),  its  stress  pattern  or 
rhythm  (the  series  of  stressed  and  unstressed  syllables  that  make  up  a sen- 
tence) and  other  similar  factors  like  the  rate  (speed)  of  speech. 


Disfluencies 

Perhaps  the  most  salient  syntactic  feature  that  distinguishes  spoken  and  writ- 
ten language  is  the  class  of  phenomena  known  as  disfluencies.  Disfluencies 
include  the  use  of  uh  and  um,  word  repetitions,  and  false  starts.  The  ATIS 
sentence  in  Figure  9.6  shows  examples  of  a false  staid  and  the  use  of  uh.  The 
false  start  here  occurs  when  the  speaker  starts  by  asking  for  one-way  flights. 
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and  then  stops  and  corrects  herself,  beginning  again  and  asking  about  one- 
way fares. 


Interruption  Point 

Does  American  airlines  offer  any  one-way  flights  ^ [uh]  one-way  fares  for  160  dollars? 

Reparandum  Interregnum  Repair 

Figure  9.6  An  example  of  a disfluency  (after  Shriberg  (1994)). 


The  segment  one-way  flights  is  referred  to  as  the  reparandum,  and  the  reparandum 
replacing  sequence  one-way  fares  is  referred  to  as  the  repair  (these  terms  arc  repair 

INTERRUP- 

from  Levelt  (1983)).  The  interruption  point,  where  the  speaker  breaks  off  tiont 
the  original  word  sequence,  here  occurs  right  after  the  word  flights’. 

The  words  uh  and  um  (sometimes  called  filled  pauses)  can  be  treated  in  pauIes 
the  lexicon  like  regular  words,  and  indeed  this  is  often  how  they  are  modeled 
in  speech  recognition.  The  HMM  pronunciation  lexicons  in  speech  recog- 
nizers often  include  pronunciation  models  of  these  words,  and  the  /V-gram 
grammar  used  by  recognizers  include  the  probabilities  of  these  occurring 
with  other  words. 

For  speech  understanding,  where  our  goal  is  to  build  a meaning  for  the 
input  sentence,  it  may  be  useful  to  detect  these  restarts  in  order  to  edit  out 
what  the  speaker  probably  considered  the  ‘corrected’  words.  For  example  in 
the  sentence  above,  if  we  could  detect  that  there  was  a restart,  we  could  just 
delete  the  reparandum,  and  parse  the  remaining  parts  of  the  sentence: 

Does  American  airlines  offer  any  one-way  flights  uh  one-way  fares 

for  160  dollars? 

How  do  disfluencies  interact  with  the  constituent  structure  of  the  sen- 
tence? Hindle  (1983)  showed  that  the  repair  often  has  the  same  structure 
as  the  constituent  just  before  the  interruption  point.  Thus  in  the  example 
above,  the  repair  is  a PP,  as  is  the  reparandum.  This  means  that  if  it  is  pos- 
sible to  automatically  find  the  interruption  point,  it  is  also  often  possible  to 
automatically  detect  the  boundaries  of  the  reparandum. 


9.10  Grammar  Equivalence  & Normal  Form 

A formal  language  is  defined  as  a (possibly  infinite)  set  of  strings  of  words. 
This  suggests  that  we  could  ask  if  two  grammars  arc  equivalent  by  asking  if 
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FORM 
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RECURSION 


they  generate  the  same  set  of  strings.  In  fact  it  is  possible  to  have  two  distinct 
context-free  grammars  generate  the  same  language. 

We  usually  distinguish  two  kinds  of  grammar  equivalence:  weak  equiv- 
alence and  strong  equivalence.  Two  grammars  arc  strongly  equivalent  if 
they  generate  the  same  set  of  string  and  if  they  assign  the  same  phrase  struc- 
ture to  each  sentence  (allowing  merely  for  renaming  of  the  non-terminal 
symbols).  Two  grammar's  are  weakly  equivalent  if  they  generate  the  same 
set  of  strings  but  do  not  assign  the  same  phrase  structure  to  each  sentence. 

It  is  sometimes  useful  to  have  a normal  form  for  grammars,  in  which 
each  of  the  productions  takes  a particular  form.  For  example  a context-free 
grammar-  is  in  Chomsky  normal  form  (CNF)  (Chomsky,  1963)  if  it  is  e-free 
and  if  in  addition  each  production  is  either  of  the  form  A — > B C or  A — > a. 
That  is,  the  righthand  side  of  each  rule  either  has  two  non-terminal  symbols 
or  one  terminal  symbol.  Chomsky  normal  form  grammar's  have  binary  trees 
(down  to  the  prelexical  nodes),  which  can  be  useful  for  certain  algorithms. 

Any  grammar  can  be  converted  into  a weakly-equivalent  Chomsky  nor- 
mal form  grammar.  For  example  a rule  of  the  form 

A ^ B C D 

can  be  converted  into  the  following  two  CNF  rules: 

A ->  8X 

X ->  CD 

Exercise  9. 1 1 asks  the  reader  to  formulate  the  complete  algorithm. 


Finite  State  & Context-Free  Grammars 

We  argued  in  Section  9.1  that  a complex  model  of  grammar  would  have  to 
represent  constituency.  This  is  one  reason  that  finite-state  models  of  gram- 
mar are  often  inadequate.  Now  that  we  have  explored  some  of  the  details  of 
the  syntax  of  noun  phrases,  we  are  prepared  to  discuss  another  problem  with 
finite-state  grammars.  This  problem  is  recursion.  Recursion  in  a grammar 
occurs  when  an  expansion  of  a non-terminal  includes  the  non-terminal  itself, 
as  we  saw  in  rules  like  Nominal  — » Nominal  PP  in  the  previous  section. 

In  order  to  see  why  this  is  a problem  for  finite-state  grammars,  let’s 
first  attempt  to  build  a finite-state  model  for  some  of  the  grammar  rules  we 
have  seen  so  far.  For  example,  we  could  model  the  noun  phrase  up  to  the 
head  with  a regular  expression  (=  FSA)  as  follows: 
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f Det)  (Card)  ( Ord ) (Quant)  (AP)  Nominal 

What  about  the  postmodifiers?  Let’s  just  try  adding  the  PP.  We  could 
then  augment  the  regular  expression  as  follows: 

(Det)  (Card)  (Ord)  (Quant)  (AP)  Nominal  (PP)* 

So  to  complete  this  regular  expression  we  just  need  to  expand  inline 
the  definition  of  PP,  as  follows: 

(Det)  (Card)  (Ord)  (Quant)  (AP)  Nominal  (P  NP)* 

But  wait;  our  definition  of  NP  now  presupposes  an  NP  I We  would  need 
to  expand  the  rule  as  follows: 

(Det)  (Card)  (Ord)  (Quant)  (AP)  Nominal  (P  (Det)  (Card)  (Ord) 

( Quant ) (AP)  Nominal  (P  NP))* 

But  of  course  the  NP  is  back  again!  The  problem  is  that  NP  is  a re- 
cursive rule.  There  is  actually  a sneaky  way  to  ‘unwind’  this  particular 
right-recursive  rule  in  a finite-state  automaton.  In  general,  however,  recur- 
sion cannot  be  handled  in  finite  automata,  and  recursion  is  quite  common 
in  a complete  model  of  the  NP  (for  example  for  RelClause  and  GerundVP, 
which  also  have  NP  in  their  expansion): 

(Det)  (Card)  (Ord)  (Quant)  (AP)  Nominal  (RelClause\GerundVP\PP)* 

In  particular,  Chomsky  (1959a)  proved  that  a context-free  language  L 
can  be  generated  by  a finite  automaton  if  and  only  if  there  is  a context-free 
grammar  that  generates  L that  does  not  have  any  center-embedded  recur- 
sions (recursions  of  the  form  A—?aA  [3). 

While  it  thus  seems  at  least  likely  that  we  can’t  model  all  of  English 
syntax  with  a finite  state  grammar,  it  is  possible  to  build  an  FSA  that  approx- 
imates English  (for  example  by  expanding  only  a certain  number  of  NPs).  In 
fact  there  are  algorithms  for  automatically  generating  finite-state  grammars 
that  approximate  context-free  grammars  (Pereira  and  Wright,  1997). 

Chapter  10  will  discuss  an  augmented  version  of  the  finite-state  au- 
tomata called  the  recursive  transition  network  or  RTN  that  adds  the  com- 
plete power  of  recursion  to  the  FSA.  The  resulting  machine  is  exactly  iso- 
morphic to  the  context-free  grammar,  and  can  be  a useful  metaphor  for 
studying  CFGs  in  certain  circumstances. 


RECURSIVE 

RULE 
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9.12  Grammars  & Human  Processing 

Do  people  use  context-free  grammars  in  their  mental  processing  of  lan- 
guage? It  has  proved  very  difficult  to  find  clear-cut  evidence  that  they  do. 
For  example,  some  early  experiments  asked  subjects  to  judge  which  words 
in  a sentence  were  more  closely  connected  (Levelt,  1970),  finding  that  their 
intuitive  group  corresponded  to  syntactic  constituents.  Other  experimenters 
examined  the  role  of  constituents  in  auditory  comprehension  by  having  sub- 
jects listen  to  sentences  while  also  listening  to  short  “clicks”  at  different 
times.  Fodor  and  Bever  (1965)  found  that  subjects  often  mis-heard  the 
clicks  as  if  they  occurred  at  constituent  boundaries.  They  argued  that  the 
constituent  was  thus  a ‘perceptual  unit’  which  resisted  interruption.  Unfor- 
tunately there  were  severe  methodological  problems  with  the  click  paradigm 
(see  for  example  Clark  and  Clark  (1977)  for  a discussion). 

A broader  problem  with  all  these  early  studies  is  that  they  do  not  con- 
trol for  the  fact  that  constituents  arc  often  semantic  units  as  well  as  syntactic 
units.  Thus,  as  will  be  discussed  further  in  Chapter  15,  a single  odd  block  is  a 
constituent  (an  NP)  but  also  a semantic  unit  (an  object  of  type  BLOCK  which 
has  certain  properties).  Thus  experiments  which  show  that  people  notice  the 
boundaries  of  constituents  could  simply  be  measuring  a semantic  rather  than 
a syntactic  fact. 

Thus  it  is  necessary  to  find  evidence  for  a constituent  which  is  not 
a semantic  unit.  Furthermore,  since  there  arc  many  non-constituent-based 
theories  of  grammar  based  on  lexical  dependencies,  it  is  important  to  find 
evidence  that  cannot  be  interpreted  as  a lexical  fact;  i.e.  evidence  for  con- 
stituency that  is  not  based  on  particular-  words. 

One  suggestive  series  of  experiments  arguing  for  constituency  has  come 
from  Kathryn  Bock  and  her  colleagues.  Bock  and  Foebell  (1990),  for  exam- 
ple, avoided  all  these  earlier  pitfalls  by  studying  whether  a subject  who  uses 
a particular-  syntactic  constituent  (for  example  a verb-phrase  of  a particular 
type,  like  V NP  PP),  is  more  likely  to  use  the  constituent  in  following  sen- 
tences. In  other  words,  they  asked  whether  use  of  a constituent  primes  its 
use  in  subsequent  sentences.  As  we  saw  in  previous  chapters,  priming  is  a 
common  way  to  test  for  the  existence  of  a mental  structure.  Bock  and  Foe- 
bell  relied  on  the  English  ditransitive  alternation.  A ditransitive  verb  is  one 
like  give  which  can  take  two  arguments: 

(9.17)  The  wealthy  widow  gave  [pjp  the  church]  [ ,y/j  her  Mercedes]. 

The  verb  give  allows  another  possible  subcategorization  frame,  called 
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a prepositional  dative  in  which  the  indirect  object  is  expressed  as  a prepo- 
sitional phrase: 

(9.18)  The  wealthy  widow  gave  \^p  her  Mercedes]  [pp  to  the  church]. 

As  we  discussed  on  page  339,  many  verbs  other  than  give  have  such 
alternations  (send,  sell,  etc;  see  Levin  (1993)  for  a summary  of  many  dif- 
ferent alternation  patterns).  Bock  and  Loebell  relied  on  these  alternations  by 
giving  subjects  a picture,  and  asking  them  to  describe  it  in  one  sentence.  The 
picture  was  designed  to  elicit  verbs  like  give  or  sell  by  showing  an  event  such 
as  a boy  handing  an  apple  to  a teacher.  Since  these  verbs  alternate,  subjects 
might,  for  example,  say  The  boy  gave  the  apple  to  the  teacher  or  The  boy 
gave  the  teacher  an  apple. 

Before  describing  the  picture,  subjects  were  asked  to  read  an  unrelated 
‘priming’  sentence  out  loud;  the  priming  sentences  either  had  V NP  NP  or 
V NP  PP  structure.  Crucially,  while  these  priming  sentences  had  the  same 
constituent  structure  as  the  dative  alternation  sentences,  they  did  not  have  the 
same  semantics.  For  example,  the  priming  sentences  might  be  prepositional 
locatives,  rather  than  datives: 

(9.19)  IBM  moved  [ ,y/j  a bigger  computer]  [pp  to  the  Seal's  store]. 

Bock  and  Loebell  found  that  subjects  who  had  just  read  a V NP  PP 
sentence  were  more  like  to  use  a VNP  PP  structure  in  describing  the  picture. 
This  suggested  that  the  use  of  a particular  constituent  primed  the  later  use  of 
that  constituent,  and  hence  that  the  constituent  must  be  mentally  represented 
in  order  to  prime  and  be  primed. 

In  more  recent  work,  Bock  and  her  colleagues  have  continued  to  find 
evidence  for  this  kind  of  constituency  structure. 

There  is  a quite  different  disagreement  about  the  human  use  of  context- 
free  grammars.  Many  researchers  have  suggested  that  natural  language  is 
unlike  a formal  language,  and  in  particular  that  the  set  of  possible  sentences 
in  a language  cannot  be  described  by  purely  syntactic  context-free  grammar 
productions.  They  argue  that  a complete  model  of  syntactic  structure  will 
prove  to  be  impossible  unless  it  includes  knowledge  from  other  domains 
(for  example  like  semantic,  intonational,  pragmatic,  and  social/interactional 
domains).  Others  argue  that  the  syntax  of  natural  language  can  be  repre- 
sented by  formal  languages.  This  second  position  is  called  modularist:  re- 
searchers holding  this  position  argue  that  human  syntactic  knowledge  is  a 
distinct  module  of  the  human  mind.  The  first  position,  in  which  grammatical 
knowledge  may  incorporate  semantic,  pragmatic,  and  other  constraints,  is 
called  anti-modularist.  We  will  return  to  this  debate  in  Chapter  15. 
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9.13  Summary 

This  chapter  has  introduced  a number  of  fundamental  concepts  in  syntax  via 

the  context-free  grammar. 

• In  many  languages,  groups  of  consecutive  words  act  as  a group  or  a 
constituent,  which  can  be  modeled  by  context-free  grammars  (also 
known  as  phrase-structure  grammars. 

• A context-free  grammar  consists  of  a set  of  rules  or  productions,  ex- 
pressed over  a set  of  non-terminal  symbols  and  a set  of  terminal  sym- 
bols. Formally,  a particular  context-free  language  is  the  set  of  strings 
which  can  be  derived  from  a particular  context-free  grammar. 

• A generative  grammar  is  a traditional  name  in  linguistics  for  a formal 
language  which  is  used  to  model  the  grammar  of  a natural  language. 

• There  are  many  sentence-level  grammatical  constructions  in  English; 
declarative,  imperative,  yes-no-question,  and  wh-question  are  four 
very  common  types,  which  can  be  modeled  with  context-free  rules. 

• An  English  noun  phrase  can  have  determiners,  numbers,  quanti- 
fiers, and  adjective  phrases  preceding  the  head  noun,  which  can  be 
followed  by  a number  of  postmodifiers;  gerundive  VPs,  infinitives 
VPs,  and  past  participial  VPs  are  common  possibilities. 

• Subjects  in  English  agree  with  the  main  verb  in  person  and  number. 

• Verbs  can  be  subcategorized  by  the  types  of  complements  they  ex- 
pect. Simple  subcategories  are  transitive  and  intransitive;  most  gram- 
mars include  many  more  categories  than  these. 

• The  correlate  of  sentences  in  spoken  language  are  generally  called  ut- 
terances. Utterances  may  be  disfluent,  containing  filled  pauses  like 
um  and  uh,  restarts,  and  repairs. 

• Any  context-free  grammar  can  be  converted  to  Chomsky  normal  form, 
in  which  the  right-hand-side  of  each  rule  has  either  two  non-terminals 
or  a single  terminal. 

• Context-free  grammars  are  more  powerful  than  finite-state  automata, 
but  it  is  nonetheless  possible  to  approximate  a context-free  grammar 
with  a FSA. 

• There  is  some  evidence  that  constituency  plays  a role  in  the  human 
processing  of  language. 
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Bibliographical  and  Historical  Notes 

“den  sprachlichen  Ausdruck  fur  die  willkiirliche  Gliederung  einer  Ge- 
sammtvorstellung  in  ihrc  in  logische  Beziehung  zueinander  gesetzten 
Bestandteile” 

“the  linguistic  expression  for  the  arbitrary  division  of  a total  idea  into 
its  constituent  parts  placed  in  logical  relations  to  one  another” 

Wundt’s  (1900:240)  definition  of  the  sentence;  the  origin  of 
the  idea  of  phrasal  constituency,  cited  in  Percival  (1976). 


The  recent  historical  research  of  Percival  (1976)  has  made  it  clear 
that  this  idea  of  breaking  up  a sentence  into  a hierarchy  of  constituents  ap- 
peared in  the  Volkerpsychologie  of  the  groundbreaking  psychologist  Wil- 
helm Wundt  (Wundt,  1900).  By  contrast,  traditional  European  grammar, 
dating  from  the  Classical  period,  defined  relations  between  words  rather  than 
constituents.  Wundt's  idea  of  constituency  was  taken  up  into  linguistics  by 
Leonard  Bloomfield  in  his  early  book  An  Introduction  to  the  Study  of  Lan- 
guage (Bloomfield,  1914).  By  the  time  of  his  later  book  Language  (Bloom- 
field, 1933),  what  was  then  called  ‘immediate-constituent  analysis’  was  a 
well-established  method  of  syntactic  study  in  the  United  States.  By  contrast, 
European  syntacticians  retained  an  emphasis  on  word-based  or  dependency 
grammars;  Chapter  12  discusses  some  of  these  issues  in  introducing  depen- 
dency grammar. 

American  Structuralism  saw  a number  of  specific  definitions  of  the 
immediate  constituent,  couched  in  terms  of  their  search  for  a ‘discovery  pro- 
cedure’; a methodological  algorithm  for  describing  the  syntax  of  a language. 
In  general,  these  attempt  to  capture  the  intuition  that  “The  primary  criterion 
of  the  immediate  constituent  is  the  degree  in  which  combinations  behave  as 
simple  units”  (Bazell,  1952,  p.  284).  The  most  well-known  of  the  specific 
definitions  is  Harris’  idea  of  distributional  similarity  to  individual  units,  with 
the  substitutability  test.  Essentially,  the  method  proceeded  by  breaking  up 
a construction  into  constituents  by  attempting  to  substitute  simple  structures 
for  possible  constituents  — if  a substitution  of  a simple  form,  say  man,  was 
substitutable  in  a construction  for  a more  complex  set  (like  intense  young 
man),  then  the  form  intense  young  man  was  probably  a constituent.  Har- 
ris’s test  was  the  beginning  of  the  intuition  that  a constituent  is  a kind  of 
equivalence  class. 
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The  first  formalization  of  this  idea  of  hierarchical  constituency  was 
the  phrase-structure  grammar  defined  in  Chomsky  (1956),  and  further  ex- 
panded upon  (and  argued  against)  in  Chomsky  (1957)  and  Chomsky  (1975). 
From  this  time  on,  most  generative  linguistic  theories  were  based  at  least 
in  part  on  context-free  grammars  (such  as  Head-Driven  Phrase  Structure 
Grammar  (Pollard  and  Sag,  1994),  Lexical-Functional  Grammar  (Bresnan, 
1982),  Government  and  Binding  (Chomsky,  1981),  and  Construction  Gram- 
mar- (Kay  and  Fillmore,  1999),  inter  alia)',  many  of  these  theories  used 
schematic  context-free  templates  known  as  X-bar  schemata. 

Shortly  after  Chomsky’s  initial  work,  the  context-free  grammar  was  re- 
discovered by  Backus  (1959)  and  independently  by  Naur  et  al.  (1960)  in  their 
descriptions  of  the  ALGOL  programming  language;  Backus  (1996)  noted 
that  he  was  influenced  by  the  productions  of  Emil  Post  and  that  Naur’s  work 
was  independent  of  his  (Backus’)  own.  After  this  early  work,  a great  num- 
ber of  computational  models  of  natural  language  processing  were  based  on 
context-free  grammars  because  of  the  early  development  of  efficient  algo- 
rithms to  parse  these  grammars  (see  Chapter  10). 

As  we  have  already  noted,  grammars  based  on  context-free  rules  are 
not  ubiquitous.  One  extended  formalism  is  Tree  Adjoining  Grammar  (TAG) 
(Joshi,  1985).  The  primary  data  structure  in  Tree  Adjoining  Grammar  is  the 
tree,  rather  than  the  rule.  Trees  come  in  two  kinds;  initial  trees  and  auxiliary 
trees.  Initial  trees  might,  for  example,  represent  simple  sentential  structures, 
while  auxiliary  trees  are  used  to  add  recursion  into  a tree.  Trees  are  combined 
by  two  operations  called  substitution  and  adjunction.  See  Joshi  (1985)  for 
more  details.  An  extension  of  Tree  Adjoining  Grammar  called  Lexicalized 
Tree  Adjoining  Grammars  will  be  discussed  in  Chapter  12. 

Another  class  of  grammatical  theories  that  are  not  based  on  context- 
free  grammars  are  instead  based  on  the  relation  between  words  rather  than 
constituents.  Various  such  theories  have  come  to  be  known  as  dependency 
grammars;  representative  examples  include  the  dependency  grammar  of 
Mel’cuk  (1979),  the  Word  Grammar  of  Hudson  (1984),  or  the  Constraint 
Grammar  of  Karlsson  et  al.  (1995).  Dependency-based  grammars  have  re- 
turned to  popularity  in  modern  statistical  parsers,  as  the  field  have  come  to 
understand  the  crucial  role  of  word-to-word  relations;  see  Chapter  12  for 
further  discussion. 

Readers  interested  in  general  references  grammars  of  English  should 
waste  no  time  in  getting  hold  of  Quirk  et  al.  (1985a).  Other  useful  treatments 
include  McCawley  (1998). 


Section  9.13.  Summary 
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There  arc  many  good  introductory  textbook  on  syntax.  Sag  and  Wasow 
(1999)  is  an  introduction  to  formal  syntax,  focusing  on  the  use  of  phrase-  syntax 
structure,  unification,  and  the  type-hierarchy  in  Head-Driven  Phrase  Struc- 
ture Grammar,  van  Valin  (1999)  is  an  introduction  from  a less  formal,  more 
functional  perspective,  focusing  on  cross-linguistic  data  and  on  the  func- 
tional motivation  for  syntactic  structures. 


Exercises 

9.1  Draw  tree  structures  for  the  following  ATIS  phrases: 

a.  Dallas 

b.  from  Denver 

c.  after  five  p.m. 

d.  arriving  in  Washington 

e.  early  flights 

f.  all  redeye  flights 

g.  on  Thursday 

h.  a one-way  fare 

i.  any  delays  in  Denver 

9.2  Draw  tree  structures  for  the  following  ATIS  sentences: 

a.  Does  American  airlines  have  a flight  between  five  a.m.  and  six  a.m. 

b.  I would  like  to  fly  on  American  airlines. 

c.  Please  repeat  that. 

d.  Does  American  487  have  a first  class  section? 

e.  I need  to  fly  between  Philadelphia  and  Atlanta. 

f.  What  is  the  fare  from  Atlanta  to  Denver? 

g.  Is  there  an  American  airlines  flight  from  Philadelphia  to  Dallas? 

9.3  Augment  the  grammar  rules  on  page  337  to  handle  pronouns.  Deal 
properly  with  person  and  case. 
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9.4  Modify  the  noun  phrase  grammar  of  Sections  9.4— 9.6  to  correctly  model 
mass  nouns  and  their  agreement  properties 

9.5  How  many  types  of  NPs  would  rule  (9.10)  on  page  332  expand  to  if  we 
didn’t  allow  parentheses  in  our  grammar  formalism? 

9.6  Assume  a grammar  that  has  many  VPs  rules  for  different  subcatego- 
rization, as  expressed  in  Section  9.7,  and  differently  subcategorized  verb 
rules  like  Verb-witli-NP -complement.  How  would  the  rule  for  post-nominal 
relative  clauses  (9.12)  need  to  be  modified  if  we  wanted  to  deal  properly  with 
examples  like  the  earliest  flight  that  you  have!  Recall  that  in  such  examples 
the  pronoun  that  is  the  object  of  the  verb  get.  Your  rules  should  allow  this 
noun  phrase  but  should  correctly  rule  out  the  ungrammatical  S *1  get. 

9.7  Does  your  solution  to  the  previous  problem  correctly  model  the  NP  the 
earliest  flight  that  I can  get ? How  about  the  earliest  flight  that  I think  my 
mother  wants  me  to  book  for  her ? Hint:  this  phenomenon  is  called  long- 
distance dependency. 

9.8  Write  rules  expressing  the  verbal  subcategory  of  English  auxiliaries; 
for  example  you  might  have  a rule  can  — > verb-with-bare-stem-VP-complement. 

possessive  9.9  NPs  like  Fortune ’s  office  or  my  uncle ’s  marks  arc  called  possessive  or 
genitive  genitive  noun  phrases.  A possessive  noun  phrase  can  be  modeled  by  treated 
the  sub-NP  like  Fortune ’s  or  my  uncle ’s  as  a determiner  of  the  following  head 
noun.  Write  grammar  rules  for  English  possessives.  You  may  treat  ’s  as  if  it 
were  a separate  word  (i.e.  as  if  there  were  always  a space  before  ’s). 

9.10  Page  330  discussed  the  need  for  a Wh-NP  constituent.  The  simplest 
Wh-NP  is  one  of  the  wh-pronouns  ( who,  whom,  whose,  which).  The  Wh- 
words , what  and  which  can  be  determiners:  which  four  will  you  have?,  what 
credit  do  you  have  with  the  Duke?.  Write  rules  for  the  different  types  of 
Wh-NPs. 

9.11  Write  an  algorithm  for  converting  an  arbitrary  context-free  grammar 
into  Chomsky  normal  form. 


PARSING  WITH 

CONTEXT-FREE 

GRAMMARS 


There  are  and  can  exist  but  two  ways  of  investigating  and  dis- 
covering truth.  The  one  hurries  on  rapidly  from  the  senses  and 
particulars  to  the  most  general  axioms,  and  from  them. . . derives 
and  discovers  the  intermediate  axioms.  The  other  constructs  its 
axioms  from  the  senses  and  particulars,  by  ascending  continu- 
ally and  graducdly,  till  it  finally  arrives  at  the  most  genercd  ax- 
ioms. 

Francis  Bacon,  Novum  Organum  Book  1.19  (1620) 


By  the  17th  century,  the  western  philosophical  tradition  had  begun 
to  distinguish  two  important  insights  about  human  use  and  acquisition  of 
knowledge.  The  empiricist  tradition,  championed  especially  in  Britain,  by  empiricist 
Bacon  and  Locke,  focused  on  the  way  that  knowledge  is  induced  and  rea- 
soning proceeds  based  on  data  and  experience  from  the  external  world.  The 
rationalist  tradition,  championed  especially  on  the  Continent  by  Descartes  rationalist 
but  following  a tradition  dating  back  to  Plato’s  Meno,  focused  on  the  way 
that  learning  and  reasoning  is  guided  by  prior  knowledge  and  innate  ideas. 

This  dialectic  continues  today,  and  has  played  a important  role  in  char- 
acterizing algorithms  for  parsing.  We  defined  parsing  in  Chapter  3 as  a 
combination  of  recognizing  an  input  string  and  assigning  some  structure  to 
it.  Syntactic  parsing,  then,  is  the  task  of  recognizing  a sentence  and  assigning 
a syntactic  structure  to  it.  This  chapter  focuses  on  the  kind  of  structures  as- 
signed by  the  context-free  grammars  of  Chapter  9.  Since  context-free  gram- 
mars arc  a declarative  formalism,  they  don’t  specify  how  the  parse  tree  for 
a given  sentence  should  be  computed.  This  chapter  will,  therefore,  present 
some  of  the  many  possible  algorithms  for  automatically  assigning  a context- 
free  (phrase  structure)  tree  to  an  input  sentence. 
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Parse  trees  arc  directly  useful  in  applications  such  as  grammar  check- 
ing in  word-processing  systems;  a sentence  which  cannot  be  parsed  may 
have  grammatical  errors  (or  at  least  be  hard  to  read).  In  addition,  parsing  is 
an  important  intermediate  stage  of  representation  for  semantic  analysis  (as 
we  will  see  in  Chapter  15),  and  thus  plays  an  important  role  in  applications 
like  machine  translation,  question  answering,  and  information  extrac- 
tion. For  example,  in  order  to  answer  the  question 


What  books  were  written  by  British  women  authors  before  1800? 


we'll  want  to  know  that  the  subject  of  the  sentence  was  what  books  and  that 
the  by-adjunct  was  British  women  authors  to  help  us  figure  out  that  the  user 
wants  a list  of  books  (and  not  just  a list  of  authors).  Syntactic  parsers  arc  also 
used  in  lexicography  applications  for  building  on-line  versions  of  dictionar- 
ies. Finally,  stochastic  versions  of  parsing  algorithms  have  recently  begun  to 
be  incorporated  into  speech  recognizers,  both  for  language  models  (Ney, 
1991)  and  for  non-finite-state  acoustic  and  phonotactic  modeling  (Laid  and 
Young,  1991). 

The  main  parsing  algorithm  presented  in  this  chapter  is  the  Earley  al- 
gorithm (Earley,  1970),  one  of  the  context-free  parsing  algorithms  based  on 
dynamic  programming.  We  have  already  seen  a number  of  dynamic  pro- 
gramming algorithms  - Minimum-Edit-Distance,  Viterbi,  Forward.  The  Ear- 
ley  algorithm  is  one  of  three  commonly-used  dynamic  programming  parsers; 
the  others  arc  the  Cocke-Younger-Kasami  (CYK)  algorithm  which  we  will 
present  in  Chapter  12,  and  the  Graham-Harrison-Ruzzo  (GHR)  (Graham 
el  at,  1980)  algorithm.  Before  presenting  the  Earley  algorithm,  we  begin  by 
motivating  various  basic  parsing  ideas  which  make  up  the  algorithm.  First, 
we  revisit  the  ‘search  metaphor’  for  parsing  and  recognition,  which  we  in- 
troduced for  finite-state  automata  in  Chapter  2,  and  talk  about  the  top-down 
and  bottom-up  search  strategies.  We  then  introduce  a ‘baseline’  top-down 
backtracking  parsing  algorithm,  to  introduce  the  idea  of  simple  but  efficient 
parsing.  While  this  parser  is  perspicuous  and  relatively  efficient,  it  is  unable 
to  deal  efficiently  with  the  important  problem  of  ambiguity:  a sentence  or 
words  which  can  have  more  than  one  parse.  The  final  section  of  the  chapter 
then  shows  how  the  Earley  algorithm  can  use  insights  from  the  top-down 
parser  with  bottom-up  filtering  to  efficiently  handle  ambiguous  inputs. 
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10.1  Parsing  as  Search 

Chapters  2 and  3 showed  that  finding  the  right  path  through  a finite-state 
automaton,  or  finding  the  right  transduction  for  an  input,  can  be  viewed  as 
a search  problem.  For  FSAs,  for  example,  the  parser  is  searching  through 
the  space  of  all  possible  paths  through  the  automaton.  In  syntactic  parsing, 
the  parser  can  be  viewed  as  searching  through  the  space  of  all  possible  parse 
trees  to  find  the  correct  parse  tree  for  the  sentence.  Just  as  the  search  space  of 
possible  paths  was  defined  by  the  structure  of  the  FSA,  so  the  search  space 
of  possible  parse  trees  is  defined  by  the  grammar.  For  example,  consider  the 
following  ATIS  sentence: 

(10.1)  Book  that  flight. 

Using  the  miniature  grammar  and  lexicon  in  Figure  10.2,  which  con- 
sists of  some  of  the  CFG  rules  for  English  introduced  in  Chapter  9,  the  cor- 
rect parse  tree  that  would  be  would  assigned  to  this  example  is  shown  in 
Figure  10.1. 


How  can  we  use  the  grammar  in  Figure  10.2  to  assign  the  parse  tree  in 
Figure  10.1  to  Example  (10.1)?  (In  this  case  there  is  only  one  parse  tree,  but 
it  is  possible  for  there  to  be  more  than  one.)  The  goal  of  a parsing  search  is  to 
find  all  trees  whose  root  is  the  start  symbol  S,  which  cover  exactly  the  words 
in  the  input.  Regardless  of  the  search  algorithm  we  choose,  there  are  clearly 
two  kinds  of  constraints  that  should  help  guide  the  search.  One  kind  of 
constraint  comes  from  the  data,  i.e.  the  input  sentence  itself.  Whatever  else 
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S -A  NP  VP 

Det  — > that  this  a 

S -a  Aux  NP  VP 

Noun  — > book  flight  meal  money 

S -A  VP 

Verb  -A  book  include  prefer 

NP  — > Det  Nominal 

Aux  -A  does 

Nominal  — > Noun 

Nominal  — > Noun  Nominal 

Prep  — > from  to  on 

NP  — > Proper-Noun 

Proper-Noun  -A  Houston  TWA 

VP  -A  Verb 

VP  -f  Verb  NP 

Nominal  -A  Nominal  PP 

Figure  10.2  A miniature  English  grammar  and  lexicon. 

is  true  of  the  final  parse  tree,  we  know  that  there  must  be  three  leaves,  and 
they  must  be  the  words  book , that , and  flight.  The  second  kind  of  constraint 
comes  from  the  grammar.  We  know  that  whatever  else  is  true  of  the  final 
parse  tree,  it  must  have  one  root,  which  must  be  the  start  symbol  S. 

These  two  constraints,  recalling  the  empiricist/rationalist  debate  de- 
scribed at  the  beginning  of  this  chapter,  give  rise  to  the  two  search  strategies 
underlying  most  parsers:  top-down  or  goal-directed  search  and  bottom-up 
or  data-directed  search. 

Top-Down  Parsing 

top-down  A top-down  parser  searches  for  a parse  tree  by  trying  to  build  from  the  root 
node  S down  to  the  leaves.  Let’s  consider  the  search  space  that  a top-down 
parser  explores,  assuming  for  the  moment  that  it  builds  all  possible  frees  in 
parallel.  The  algorithm  starts  by  assuming  the  input  can  be  derived  by  the 
designated  start  symbol  S.  The  next  step  is  to  find  the  tops  of  all  frees  which 
can  start  with  S,  by  looking  for  all  the  grammar  rules  with  S on  the  left-hand 
side.  In  the  grammar  in  Figure  10.2,  there  arc  three  rules  that  expand  S,  so 

ply  the  second  ply,  or  level,  of  the  search  space  in  Figure  10.3  has  three  partial 

frees. 

We  next  expand  the  constituents  in  these  three  new  trees,  just  as  we 
originally  expanded  S.  The  first  tree  tells  us  to  expect  an  NP  followed  by  a 
VP,  the  second  expects  an  Aux  followed  by  an  NP  and  a VP,  and  the  third  a 
VP  by  itself.  To  fit  the  search  space  on  the  page,  we  have  shown  in  the  third 
ply  of  Figure  10.3  only  the  trees  resulting  from  the  expansion  of  the  left-most 
leaves  of  each  tree.  At  each  ply  of  the  search  space  we  use  the  right-hand- 
sides  of  the  rules  to  provide  new  sets  of  expectations  for  the  parser,  which 
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arc  then  used  to  recursively  generate  the  rest  of  the  trees.  Trees  arc  grown 
downward  until  they  eventually  reach  the  part-of-speech  categories  at  the 
bottom  of  the  tree.  At  this  point,  trees  whose  leaves  fail  to  match  all  the 
words  in  the  input  can  be  rejected,  leaving  behind  those  trees  that  represent 
successful  parses. 

In  Figure  10.3,  only  the  5th  parse  tree  (the  one  which  has  expanded 
the  rule  VP  — > Verb  NP)  will  eventually  match  the  input  sentence  Book  that 
flight.  The  reader  should  check  this  for  themselves  in  Figure  10.1. 

Bottom-Up  Parsing 

Bottom-up  parsing  is  the  earliest  known  parsing  algorithm  (it  was  first  sug-  bottom-up 
gested  by  Yngve  (1955)),  and  is  used  in  the  shift-reduce  parsers  common 
for  computer  languages  (Aho  and  Ullman,  1972).  In  bottom-up  parsing,  the 
parser  starts  with  the  words  of  the  input,  and  tries  to  build  trees  from  the 
words  up,  again  by  applying  rules  from  the  grammar  one  at  a time.  The 
parse  is  successful  if  the  parser  succeeds  in  building  a tree  rooted  in  the  start 
symbol  S that  covers  all  of  the  input.  Figure  10.4  show  the  bottom-up  search 
space,  beginning  with  the  sentence  Book  that  flight.  The  parser  begins  by 
looking  up  each  word  {book,  that,  and  flight)  in  the  lexicon  and  building 
three  partial  trees  with  the  paid  of  speech  for  each  word.  But  the  word  book 
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is  ambiguous;  it  can  be  a noun  or  a verb.  Thus  the  parser  must  consider 
two  possible  sets  of  trees.  The  first  two  plies  in  Figure  10.4  show  this  initial 
bifurcation  of  the  search  space. 


Book 

that  flight 

Noun  Det  Noun 

Verb  Det  Noun 

Book  that  flight 

Book  that  flight 

NOM  NOM 

NOM 

Noun  Det  Noun 

Verb 

Det  Noun 

Book  that  flight 

Book 

that  flight 

NP 

NP 

NOM  / 

NOM  VP 

NOM 

r 

NOM 

Noun  Det 

Noun  Verb 

Det  Noun 

Verb  Det 

Noun 

Book  that 

flight  Book 

that  flight 

Book  that 

flight 

VP 

VP 

NP 

/ NP 

/ /\ 

NOM 

j / 

NOM 

Verb  Det 

Noun 

Verb  Det 

Noun 

Book  that 

flight 

Book  that 

flight 

Figure  10.4  An  expanding  bottom-up  search  space  for  the  sentence  Book 
that  flight.  This  figure  does  not  show  the  final  tier  of  the  search  with  the  correct 
parse  tree  (see  Figure  10.1).  Make  sure  you  understand  how  that  final  parse 
tree  follows  from  the  search  space  in  this  figure. 

Each  of  the  trees  in  the  second  ply  arc  then  expanded.  In  the  parse 
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on  the  left  (the  one  in  which  book  is  incorrectly  considered  a noun),  the 
Nominal  — > Noun  rule  is  applied  to  both  of  the  Nouns  ( book  and  flight).  This 
same  rule  is  also  applied  to  the  sole  Noun  (flight)  on  the  right,  producing  the 
trees  on  the  third  ply. 

In  general,  the  parser  extends  one  ply  to  the  next  by  looking  for  places 
in  the  parse-in-progress  where  the  right-hand-side  of  some  rule  might  fit. 
This  contrasts  with  the  earlier  top-down  parser,  which  expanded  trees  by  ap- 
plying rules  when  their  left-hand  side  matched  an  unexpanded  nonterminal. 
Thus  in  the  fourth  ply,  in  the  first  and  third  parse,  the  sequence  Det  Nominal 
is  recognized  as  the  right-hand  side  of  the  NP  — > Det  Nominal  rule. 

In  the  fifth  ply,  the  interpretation  of  book  as  a noun  has  been  pruned 
from  the  search  space.  This  is  because  this  parse  cannot  be  continued:  there 
is  no  rule  in  the  grammar  with  the  right-hand  side  Nominal  NP. 

The  final  ply  of  the  search  space  (not  shown  in  Figure  10.4)  is  the 
correct  parse  tree  (see  Figure  10.1).  Make  sure  you  understand  which  of  the 
two  parses  on  the  penultimate  ply  gave  rise  to  this  parse. 

Comparing  Top-down  and  Bottom-up  Parsing 

Each  of  these  two  architectures  has  its  own  advantages  and  disadvantages. 
The  top-down  strategy  never  wastes  time  exploring  frees  that  cannot  result 
in  an  S,  since  it  begins  by  generating  just  those  frees.  This  means  it  also 
never  explores  subtrees  that  cannot  find  a place  in  some  S-rooted  tree.  In  the 
bottom-up  strategy,  by  contrast,  trees  that  have  no  hope  of  leading  to  an  S, 
or  fitting  in  with  any  of  their  neighbors,  arc  generated  with  wild  abandon. 
For  example  the  left  branch  of  the  search  space  in  Figure  10.4  is  completely 
wasted  effort;  it  is  based  on  interpreting  book  as  a Noun  at  the  beginning  of 
the  sentence  despite  the  fact  no  such  tree  can  lead  to  an  S given  this  grammar. 

The  top-down  approach  has  its  own  inefficiencies.  While  it  does  not 
waste  time  with  trees  that  do  not  lead  to  an  S,  it  does  spend  considerable 
effort  on  S trees  that  arc  not  consistent  with  the  input.  Note  that  the  first 
four  of  the  six  frees  in  the  third  ply  in  Figure  10.3  all  have  left  branches  that 
cannot  match  the  word  book.  None  of  these  frees  could  possibly  be  used 
in  parsing  this  sentence.  This  weakness  in  top-down  parsers  arises  from  the 
fact  that  they  can  generate  trees  before  ever  examining  the  input.  Bottom-up 
parsers,  on  the  other  hand,  never  suggest  trees  that  arc  not  at  least  locally 
grounded  in  the  actual  input. 

Neither  of  these  approaches  adequately  exploits  the  constraints  pre- 
sented by  the  grammar  and  the  input  words.  In  the  next  section,  we  present 
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a baseline  parsing  algorithm  that  incorporates  features  of  both  the  top-down 
and  bottom-up  approaches.  This  parser  is  not  as  efficient  as  the  Earley  or 
CYK  parsers  we  will  introduce  later,  but  it  is  useful  for  showing  the  basic 
operations  of  parsing. 


10.2  A Basic  Top-down  Parser 

There  arc  any  number  of  ways  of  combining  the  best  features  of  top-down 
and  bottom-up  parsing  into  a single  algorithm.  One  fairly  straightforward 
approach  is  to  adopt  one  technique  as  the  primary  control  strategy  used  to 
generate  trees  and  then  use  constraints  from  the  other  technique  to  filter  out 
inappropriate  parses  on  the  fly.  The  parser  we  develop  in  this  section  uses  a 
top-down  control  strategy  augmented  with  a bottom-up  filtering  mechanism. 
Our  first  step  will  be  to  develop  a concrete  implementation  of  the  top-down 
strategy  described  in  the  last  section.  The  ability  to  filter  bad  parses  based  on 
bottom-up  constraints  from  the  input  will  then  be  grafted  onto  this  top-down 
parser. 

In  our  discussions  of  both  top-down  and  bottom-up  parsing,  we  as- 
parallel  sumed  that  we  would  explore  all  possible  parse  trees  in  parallel.  Thus  each 
ply  of  the  search  in  Figure  10.3  and  Figure  10.4  showed  all  possible  expan- 
sions of  the  parse  trees  on  the  previous  plies.  Although  it  is  certainly  possible 
to  implement  this  method  directly,  it  typically  entails  the  use  of  an  unrealistic 
amount  of  memory  to  store  the  space  of  trees  as  they  arc  being  constructed. 
This  is  especially  flue  since  realistic  grammars  have  much  more  ambiguity 
than  the  miniature  grammar  in  Figure  10.2. 

strategyst  A more  reasonable  approach  is  to  use  a depth-first  strategy  such  as 

the  one  used  to  implement  the  various  finite  state  machines  in  Chapter  2 and 
Chapter  3.  The  depth-first  approach  expands  the  search  space  incrementally 
by  systematically  exploring  one  state  at  a time.  The  state  chosen  for  expan- 
sion is  the  most  recently  generated  one.  When  this  strategy  arrives  at  a free 
that  is  inconsistent  with  the  input,  the  search  continues  by  returning  to  the 
most  recently  generated,  as  yet  unexplored,  tree.  The  net  effect  of  this  strat- 
egy is  a parser  that  single-mindedly  pursues  trees  until  they  either  succeed  or 
fail  before  returning  to  work  on  trees  generated  earlier  in  the  process.  Figure 
10.5  illustrates  such  a top-down,  depth-first  derivation  using  Grammar  10.2. 

Note  that  this  derivation  is  not  fully  determined  by  the  specification  of  a 
top-down,  depth-first  strategy.  There  arc  two  kinds  of  choices  that  have  been 
left  unspecified  that  can  lead  to  different  derivations:  the  choice  of  which 
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leaf  node  of  a tree  to  expand  and  the  order  in  which  applicable  grammar 
rules  arc  applied.  In  this  derivation,  the  left-most  unexpanded  leaf  node  of 
the  current  tree  is  being  expanded  first,  and  the  applicable  rules  of  the  gram- 
mar arc  being  applied  according  to  their  textual  order  in  the  grammar.  The 
decision  to  expand  the  left-most  unexpanded  node  in  the  tree  is  important 
since  it  determines  the  order  in  which  the  input  words  will  be  consulted  as 
the  tree  is  constructed.  Specifically,  it  results  in  a relatively  natural  forward 
incorporation  of  the  input  words  into  a tree.  The  second  choice  of  applying 
rules  in  their  textual  order  has  consequences  that  will  be  discussed  later. 

Figure  10.6  presents  a parsing  algorithm  that  instantiates  this  top-down, 
depth-first,  left-to-right  strategy.  This  algorithm  maintains  an  agenda  of  agenda 
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function  T OP  - D O \v  N - Pa  R S e( inp u t,  grammar)  returns  a parse  tree 

agenda  <—  (Initial  S tree,  Beginning  of  input) 
current-search-state  <—  POP  (agenda) 

loop 

if  SUCCESSFUL-PARSE  1 (current-search-state)  then 
return  TREE(current-search-state) 

else 

if  Cat(Node-To-Expand (current-search-state))  is  a POS  then 
if  CAT(node-to-expand) 

C 

POS  (CURRENT-  mP\JT(current-search-state))  then 
PUSH(  APPLY-LEXlCAL-RULE(current-search-state),  agenda) 

else 

return  reject 
else 

PUSH(  APPLY-RULES (current-search-state,  grammar),  agenda) 
if  agenda  is  empty  then 
return  reject 
else 

current-search-state  4—  NEXT  (agenda) 

end 


Figure  10.6  A top-down,  depth-first  left-to-right  parser. 


search-states.  Each  search-state  consists  of  partial  trees  together  with  a 
pointer  to  the  next  input  word  in  the  sentence. 

The  main  loop  of  the  parser  takes  a state  from  the  front  of  the  agenda 
and  produces  a new  set  of  states  by  applying  all  the  applicable  grammar  rules 
to  the  left-most  unexpanded  node  of  the  tree  associated  with  that  state.  This 
set  of  new  states  is  then  added  to  the  front  of  the  agenda  in  accordance  with 
the  textual  order  of  the  grammar  rules  that  were  used  to  generate  them.  This 
process  continues  until  either  a successful  parse  tree  is  found  or  the  agenda 
is  exhausted  indicating  that  the  input  can  not  be  parsed. 

Figure  10.7  shows  the  sequence  of  states  examined  by  this  algorithm 
in  the  course  of  parsing  the  following  sentence. 

(10.2)  Does  this  flight  include  a meal? 

In  this  figure,  the  node  currently  being  expanded  is  shown  in  a box,  while 
the  current  input  word  is  bracketed.  Words  to  the  left  of  the  bracketed  word 


[Does]  [Does]  [Does] 


[Does] 


[Does] 


Does  [this]  Does  [this] 


Does  this  [flight]  Does  this  [flight] 


S 

S 

AUX 

AUX 

^VP 

Det 

Nom 

Det 

Nom  Verb 

Noun 

Noun 

1 

Does  this 

flight  [include] 

1 

Does  this 

flight  [include] 

Figure  10.7 

A top-down,  depth-first,  left  to  right  derivation. 

have  already  been  incorporated  into  the  tree. 

The  parser  begins  with  a fruitless  exploration  of  the  S — > NP  VP  rule, 
which  ultimately  fails  because  the  word  Does  cannot  be  derived  from  any 
of  the  parts-of-speech  that  can  begin  an  NP.  The  parser  thus  eliminates  the 
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S — > NP  VP  rule.  The  next  search-state  on  the  agenda  corresponds  to  the 
S — > AuxNP  VP  rule.  Once  this  state  is  found,  the  search  continues  in  a 
straightforward  depth-first,  left  to  right  fashion  through  the  rest  of  the  deriva- 
tion. 
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Adding  Bottom-up  Filtering 

Figure  10.7  shows  an  important  qualitative  aspect  of  the  top-down  parser. 

Beginning  at  the  root  of  the  parse  tree,  the  parser  expands  non-terminal  sym- 
bols along  the  left  edge  of  the  tree,  down  to  the  word  at  the  bottom  left 
edge  of  the  tree.  As  soon  as  a word  is  incorporated  into  a tree,  the  input 
pointer  moves  on,  and  the  parser  will  expand  the  new  next  left-most  open 
non-terminal  symbol  down  to  the  new  left  corner  word. 

Thus  in  any  successful  parse  the  current  input  word  must  serve  as  the 
first  word  in  the  derivation  of  the  unexpanded  node  that  the  parser  is  currently 
processing.  This  leads  to  an  important  consequence  which  will  be  useful  in 
adding  bottom-up  filtering.  The  parser  should  not  consider  any  grammar 
rule  if  the  current  input  cannot  serve  as  the  first  word  along  the  left  edge  of 
some  derivation  from  this  rule.  We  call  the  first  word  along  the  left  edge  of 
a derivation  the  left-corner  of  the  tree.  left-corner 


VP 

^NP 

VP 

// 

Nom 

\ 

Verb  Det 

Noun  Noun  Verb 

Det  Noun 

Noun 

prefer  a 

morning  flight  prefer 

a morning 

flight 

Figure  10.9  An  illustration  of  the  left-corner  notion.  The  node  Verb  and 
the  node  prefer  are  both  left-corners  of  VP. 

Consider  the  parse  tree  for  a VP  shown  in  Figure  10.9.  If  we  visualize 
the  parse  tree  for  this  VP  as  a triangle  with  the  words  along  the  bottom, 
the  word  prefer  lies  at  the  lower  left-corner  of  the  tree.  Formally,  we  can 
say  that  for  non-terminals  A and  B,  B is  a left-corner  of  A if  the  following 
relation  holds: 

A Ba 

In  other  words,  B can  be  a left-corner  of  A if  there  is  a derivation  of  A that 
begins  with  a B. 

We  return  to  our  example  sentence  Does  this  flight  include  a meal? 
The  grammar  in  Figure  10.2  provides  us  with  three  rules  that  can  be  used  to 
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expand  the  category  S: 

S — > NP  VP 

S ->  Aux  NP  VP 

S ->  VP 

Using  the  left-corner  notion,  it  is  easy  to  see  that  only  the  S — > Aux  NP  VP 
rule  is  a viable  candidate  since  the  word  Does  can  not  serve  as  the  left-corner 
of  either  the  NP  or  the  VP  required  by  the  other  two  S rules.  Knowing  this, 
the  parser  should  concentrate  on  the  Aux  NP  VP  rule,  without  first  construct- 
ing and  backtracking  out  of  the  others,  as  it  did  with  the  non-filtering  exam- 
ple shown  in  Figure  10.7. 

The  information  needed  to  efficiently  implement  such  a filter  can  be 
compiled  in  the  form  of  a table  that  lists  all  the  valid  left-corner  categories 
for  each  non-terminal  in  the  grammar.  When  a rule  is  considered,  the  table 
entry  for  the  category  that  starts  the  right  hand  side  of  the  rule  is  consulted.  If 
it  fails  to  contain  any  of  the  parts-of-speech  associated  with  the  current  input 
then  the  rule  is  eliminated  from  consideration.  The  following  table  shows 
the  left-corner  table  for  Grammar  10.2. 


Category 

Left  Corners 

S 

NP 

Nominal 

VP 

Det,  Proper-Noun,  Aux,  Verb 

Det,  Proper-Noun 

Noun 

Verb 

Using  this  left-corner  table  as  a filter  in  the  parsing  algorithm  of  Figure  10.6 
is  left  as  Exercise  10.1  for  the  reader. 


10.3  Problems  with  the  Basic  Top-down  Parser 


Even  augmented  with  bottom-up  filtering,  the  top-down  parser  in  Figure  10.7 
has  three  problems  that  make  it  an  insufficient  solution  to  the  general-purpose 
parsing  problem.  These  three  problems  arc  left-recursion,  ambiguity,  and 
inefficient  reparsing  of  subtrees.  After  exploring  the  nature  of  these  three 
problems,  we  will  introduce  the  Earley  algorithm  which  is  able  to  avoid  all 
of  them. 
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Left-Recursion 

Depth-first  search  has  a well-known  flaw  when  exploring  an  infinite  search 
space:  it  may  dive  down  an  infinitely-deeper  path  and  never  return  to  visit 
the  unexpanded  states.  This  problem  manifests  itself  in  top-down,  depth- 
first,  left-to-right  parsers  when  left-recursive  grammars  arc  used.  Formally, 
a grammar  is  left-recursive  if  it  contains  at  least  one  non-terminal  A,  such 
that  A =>  cjc4 [3 , for  some  a and  [3  and  a =>  E.  In  other  words,  a grammar 
is  left-recursive  if  it  contains  a non-terminal  category  that  has  a derivation 
that  includes  itself  anywhere  along  its  leftmost  branch.  The  grammar  of 
Chapter  9 had  just  such  a left-recursive  example,  in  the  rules  for  possessive 
/VPs  like  Atlanta ’s  airport: 

NP  — > Det  Nominal 

Det  — > NP  1 s 

These  rules  introduce  left-recursion  into  the  grammar  since  there  is  a deriva- 
tion for  the  first  element  of  the  NP , the  Det,  that  has  an  NP  as  its  first  con- 
stituent. 

A more  obvious  and  common  case  of  left-recursion  in  natural  language 
grammars  involves  immediately  left-recursive  rules.  These  arc  rules  of  the 
form  A — > A |3,  where  the  first  constituent  of  the  right  hand  side  is  identi- 
cal to  the  left  hand  side.  The  following  arc  some  of  the  immediately  left- 
recursive  rules  that  make  frequent  appearances  in  grammars  of  English. 

NP  ->  NP  PP 
VP  -4  VP  PP 
S ->  SandS 

A left-recursive  non-terminal  can  lead  a top-down,  depth-first  left-to- 
right  parser  to  recursively  expand  the  same  non-terminal  over  again  in  ex- 
actly the  same  way,  leading  to  an  infinite  expansion  of  frees. 

Figure  10.10  shows  the  kind  of  expansion  that  accompanies  the  addi- 
tion of  the  NP  — > NP  PP  rule  as  the  first  NP  rule  in  our  small  grammar. 

There  arc  two  reasonable  methods  for  dealing  with  left-recursion  in  a 
backtracking  top-down  parser:  rewriting  the  grammar,  and  explicitly  man- 
aging the  depth  of  the  search  during  parsing.  Recall  from  Chapter  9,  that 
it  is  often  possible  to  rewrite  the  rules  of  a grammar  into  a weakly  equiva- 
lent new  grammar  that  still  accepts  exactly  the  same  language  as  the  origi- 
nal grammar.  It  is  possible  to  eliminate  left-recursion  from  certain  common 
classes  of  grammars  by  rewriting  a left-recursive  grammar  into  a weakly 
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equivalent  non-left-recursive  one.  The  intuition  is  to  rewrite  each  rule  of  the 
form  A — > A P according  to  the  following  schema,  using  a new  symbol  A’: 


A 


A P | a 


A — > aA’ 

A’  ->  PA’  | £ 


This  transformation  changes  the  left-recursion  to  a right-recursion,  and  changes 
the  trees  that  result  from  these  rules  from  left-branching  structures  to  a right- 
branching ones.  Unfortunately,  rewriting  grammars  in  this  way  has  a major 
disadvantage:  a rewritten  phrase-structure  rule  may  no  longer  be  the  most 
grammatically  natural  way  to  represent  a particular  syntactic  structure.  Fur- 
thermore, as  we  will  see  in  Chapter  15,  this  rewriting  may  make  semantic 
interpretation  quite  difficult. 


Ambiguity 

One  morning  I shot  an  elephant  in  my  pajamas.  How  he  got  into 

my  pajamas  I don ’t  know. 

Groucho  Marx,  Animal  Crackers,  1930 

The  second  problem  with  the  top-down  parser  of  Figure  10.6  is  that  it  is 
ambiguity  not  efficient  at  handling  ambiguity.  Chapter  8 introduced  the  idea  of  lexical 
category  ambiguity  (words  which  may  have  more  than  one  paid  of  speech) 
and  disambiguation  (choosing  the  correct  paid  of  speech  for  a word). 

In  this  section  we  introduce  a new  kind  of  ambiguity,  which  arises  in 
the  syntactic  structures  used  in  parsing,  called  structural  ambiguity.  Struc- 
tural ambiguity  occurs  when  the  grammar  assigns  more  than  one  possible 
parse  to  a sentence.  Groucho  Marx’s  well-known  line  as  Captain  Spaulding 
for  the  wavfile)  is  ambiguous  because  the  phrase  in  my  pajamas  can  be  paid 
of  the  NP  headed  by  elephant  or  the  verb-phrase  headed  by  shot. 


Section  10.3. 


Problems  with  the  Basic  Top-down  Parser 


369 


I shot  an  elephant  in  my  pajamas  I shot  an  elephant  in  my  pajamas 


Figure  10.11  Two  parse  trees  for  an  ambiguous  sentence.  Parse  (a)  corre- 
sponds to  the  humorous  reading  in  which  the  elephant  is  in  the  pajamas,  parse 
(b)  to  the  reading  in  which  Captain  Spaulding  did  the  shooting  in  his  pajamas. 


Structural  ambiguity,  appropriately  enough,  comes  in  many  forms.  Three 
particularly  common  kinds  of  ambiguity  are  attachment  ambiguity,  coor- 
dination ambiguity,  and  noun-phrase  bracketing  ambiguity. 

A sentence  has  an  attachment  ambiguity  if  a particular  constituent  can 
be  attached  to  the  parse  tree  at  more  than  one  place.  The  Groucho  Marx 
sentence  above  is  an  example  of  PP-attachment  ambiguity.  Various  kinds 
of  adverbial  phrases  arc  also  subject  to  this  kind  of  ambiguity.  For  example 
in  the  following  example  the  gerundive-VP  flying  to  New  York  can  be  paid 
of  a gerundive  sentence  whose  subject  is  the  Grand  Canyon  or  it  can  be  an 
adjunct  modifying  the  VP  headed  by  saw: 

(10.3)  I saw  the  Grand  Canyon  flying  to  New  York. 

In  a similar  kind  of  ambiguity,  the  sentence  “Can  you  book  TWA 
flights”  is  ambiguous  between  a reading  meaning  ‘Can  you  book  flights 
on  behalf  of  TWA’,  and  the  other  meaning  ‘Can  you  book  flights  run  by 
TWA’).  Here  either  one  NP  is  attached  to  another  to  form  a complex  NP 
(TWA  flights),  or  both  NPs  arc  distinct  daughters  of  the  verb  phrase.  Fig- 
ure 10.12  shows  both  parses. 

Another  common  kind  of  ambiguity  is  coordination  ambiguity,  in 
which  there  arc  different  sets  of  phrases  that  can  be  conjoined  by  a conjunc- 
tion like  and.  For  example,  the  phrase  old  men  and  women  can  be  bracketed 
[old  [men  and  women]],  referring  to  old  men  and  old  women,  or  as  [old  men] 
and  [ women [,  in  which  case  it  is  only  the  men  who  arc  old. 
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These  ambiguities  all  combine  in  complex  ways.  A program  that  sum- 
marized the  news,  for  example,  would  need  to  be  able  to  parse  sentences  like 
the  following  from  the  Brown  corpus  : 

(10.4)  President  Kennedy  today  pushed  aside  other  White  House  business 
to  devote  all  his  time  and  attention  to  working  on  the  Berlin  crisis 
address  he  will  deliver  tomorrow  night  to  the  American  people  over 
nationwide  television  and  radio. 

This  sentence  has  a number  of  ambiguities,  although  since  they  arc  se- 
mantically unreasonable,  it  requires  a careful  reading  to  see  them.  The  last 
noun  phrase  could  be  parsed  [nationwide  [television  and  radio]]  or  ] [na- 
tionwide television]  and  radio].  The  direct  object  of  pushed  aside  should 
be  other  White  House  business  but  could  also  be  the  bizarre  phrase  [other 
White  House  business  to  devote  all  his  time  and  attention  to  working]  (i.e. 
a structure  like  Kennedy  denied  [his  intention  to  propose  a new  budget  to 
address  the  deficit]).  Then  the  phrase  on  the  Berlin  crisis  address  he  will 
deliver  tomorrow  night  to  the  American  people  could  be  an  adjunct  modi- 
fying the  verb  pushed.  The  PP  over  nationwide  television  and  radio  could 
be  attached  to  any  of  the  higher  VPs  or  NPs  (for  example  it  could  modify 
people  or  night). 

The  fact  that  there  arc  many  unreasonable  parses  for  a sentence  is  an 
extremely  irksome  problem  that  affects  all  parsers.  In  practice,  parsing  a 
tionmbigua"  sentence  thus  requires  disambiguation:  choosing  the  correct  parse  from  a 
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multitude  of  possible  parsers.  Disambiguation  algorithms  generally  require 
both  statistical  and  semantic  knowledge,  so  they  will  be  introduced  later,  in 
Chapter  12  and  Chapter  17. 

Parsers  which  do  not  incorporate  disambiguators  must  simply  return 
all  the  possible  parse  trees  for  a given  input.  Since  the  top-down  parser  of 
Figure  10.7  only  returns  the  first  parse  it  finds,  it  would  thus  need  to  be 
modified  to  return  all  the  possible  parses.  The  algorithm  would  be  changed 
to  collect  each  parse  as  it  is  found  and  continue  looking  for  more  parses. 
When  the  search  space  has  been  exhausted,  the  list  of  all  the  trees  found  is 
returned.  Subsequent  processing  or  a human  analyst  can  then  decide  which 
of  the  returned  parses  is  correct. 

Unfortunately,  we  almost  certainly  do  not  want  all  possible  parses  from 
the  robust,  highly  ambiguous,  wide-coverage  grammars  used  in  practical  ap- 
plications. The  reason  for  this  lies  in  the  potentially  exponential  number  of 
parses  that  arc  possible  for  certain  inputs.  Consider  the  ATIS  example  (10.5): 

(10.5)  Show  me  the  meal  on  Flight  UA  386  from  San  Francisco  to  Denver. 


When  our  extremely  small  grammar  is  augmented  with  the  recursive  VP  — > 
VP  PP  and  NP  — > NP  PP  rules  introduced  above,  the  three  prepositional 
phrases  at  the  end  of  this  sentence  conspire  to  yield  a total  of  14  parse  trees 
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LOCAL 

AMBIGUITY 


for  this  sentence.  For  example  from  San  Francisco  could  be  paid  of  the  VP 
headed  by  show  (which  would  have  the  bizarre  interpretation  that  the  show- 
ing was  happening  from  San  Francisco). 

Church  and  Patil  (1982)  showed  that  the  number  of  parses  for  sen- 
tences of  this  type  grows  at  the  same  rate  as  the  number  of  parenthesiza- 
tions  of  arithmetic  expressions.  Such  parenthesization  problems,  in  turn,  arc 
known  to  grow  exponentially  in  accordance  with  what  arc  called  the  Catalan 
numbers: 


C(n)  = 


n + 1 


2 n 
n 


The  following  table  shows  the  number  of  parses  for  a simple  noun- 
phrase as  a function  of  the  number  of  trailing  prepositional  phrases.  As  can 
be  seen,  this  kind  of  ambiguity  can  very  quickly  make  it  imprudent  to  keep 
every  possible  parse  around. 


Number  of 
PPs 

Number  of 
NP  Parses 

2 

2 

3 

5 

4 

14 

5 

132 

6 

469 

7 

1430 

8 

4867 

There  arc  two  basic  ways  out  of  this  dilemma:  using  dynamic  pro- 
gramming to  exploit  regularities  in  the  search  space  so  that  common  sub- 
parts arc  derived  only  once,  thus  reducing  some  of  the  costs  associated  with 
ambiguity,  and  augmenting  the  parser’s  search  strategy  with  heuristics  that 
guide  it  toward  likely  parses  first.  The  dynamic  programming  approach  will 
be  explored  in  the  next  section,  while  the  heuristic  search  strategies  will  be 
covered  in  Chapter  12. 

Even  if  a sentence  isn’t  ambiguous,  it  can  be  inefficient  to  parse  due 
to  local  ambiguity.  Local  ambiguity  occurs  when  some  paid  of  a sentence 
is  ambiguous,  i.e.  has  more  than  parse,  even  if  the  whole  sentence  is  not 
ambiguous.  For  example  the  sentence  Book  that  flight  is  unambiguous,  but 
when  the  parser  sees  the  first  word  Book , it  cannot  know  if  it  is  a verb  or 
a noun  until  later.  Thus  it  must  use  backtracking  or  parallelism  to  consider 
both  possible  parses. 
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Repeated  Parsing  of  Subtrees 

The  ambiguity  problem  is  related  to  another  inefficiency  of  the  top-down 
parser  of  Section  10.2.  The  parser  often  builds  valid  trees  for  portions  of 
the  input,  then  discards  them  during  backtracking,  only  to  find  that  it  has  to 
rebuild  them  again.  Consider  the  process  involved  in  finding  a parse  for  the 
NP  in  (10.6): 

(10.6)  a flight  from  Indianapolis  to  Houston  on  TWA 

The  preferred  parse,  which  is  also  the  one  found  first  by  the  parser  presented 
in  Section  10.2,  is  shown  as  the  bottom  tree  in  Figure  10.14.  While  there  arc 
5 distinct  parses  of  this  phrase,  we  will  focus  here  on  the  ridiculous  amount 
repeated  work  involved  in  retrieving  this  single  parse. 

Because  of  the  way  the  rules  arc  consulted  in  our  top-down,  depth- 
first,  left-to-right  approach,  the  parser  is  led  first  to  small  parse  trees  that  fail 
because  they  do  not  cover  all  of  the  input.  These  successive  failures  trigger 
backtracking  events  which  lead  to  parses  that  incrementally  cover  more  and 
more  of  the  input.  The  sequence  of  frees  attempted  by  our  top-down  parser 
is  shown  in  Figure  10.14. 

This  figure  clearly  illustrates  the  kind  of  silly  reduplication  of  work 
that  arises  in  backtracking  approaches.  Except  for  its  topmost  component, 
every  paid  of  the  final  free  is  derived  more  than  once.  The  following  table 
shows  the  number  of  times  that  each  of  the  major  constituents  in  the  final  free 
is  derived.  The  work  done  on  this  example  would,  of  course,  be  magnified 
by  any  backtracking  caused  by  the  verb  phrase  or  sentential  level.  Note,  that 
although  this  example  is  specific  to  top-down  parsing,  similar  examples  of 
wasted  effort  exist  for  bottom-up  parsing  as  well. 


a flight  4 

from  Indianapolis  3 

to  Houston  2 

on  TWA  1 

a flight  from  Indianapolis  3 

a flight  from  Indianapolis  to  Houston  2 


a flight  from  Indianapolis  to  Houston  on  TWA  1 


Figure  10.14  Reduplicated  effort  caused  by  backtracking  in  top-down  pars- 
ing. 
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10.4  The  Earley  Algorithm 


The  previous  section  presented  three  kinds  of  problems  that  afflict  standard 
bottom-up  or  top-down  parsers,  even  when  they  have  been  augmented  with 
filtering  and  other  improvements:  left-recursive  rules,  ambiguity,  and  inef- 
ficient reparsing  of  subtrees.  Luckily,  there  is  a single  class  of  algorithms 
which  can  solve  all  these  problems.  Dynamic  programming  once  again 
provides  a framework  for  solving  this  problem,  just  as  it  helped  us  with  the 
Minimum  Edit  Distance,  Viterbi,  and  Forward  algorithms.  Recall  that  dy- 
namic programming  approaches  systematically  fill  in  tables  of  solutions  to 
sub-problems.  When  complete,  the  tables  contain  the  solution  to  all  the  sub- 
problems needed  to  solve  the  problem  as  a whole.  In  the  case  of  parsing, 
such  a table  is  used  to  store  subtrees  for  each  of  the  various  constituents  in 
the  input  as  they  arc  discovered.  The  efficiency  gain  arises  from  the  fact  that 
these  subtrees  arc  discovered  once,  stored,  and  then  used  in  all  parses  calling 
for  that  constituent.  This  solves  the  reparsing  problem  (subtrees  arc  looked 
up,  not  re-parsed)  and  the  ambiguity  problem  (the  parsing  table  implicitly 
stores  all  possible  parses  by  storing  all  the  constituents  with  links  that  enable 
the  parses  to  be  reconstructed).  Furthermore,  dynamic  programming  parsing 
algorithms  also  solve  the  problem  of  left-recursion.  As  we  discussed  ear- 
lier. there  arc  three  well-known  dynamic  programming  parsers:  the  Cocke  - 
Younger-Kasami  (CYK)  algorithm  which  we  will  present  in  Chapter  12,  the 
Graham-Harrison-Ruzzo  (GHR)  (Graham  et  al.,  1980)  algorithm  and  the 
Earley  algorithm  (Earley,  1970)  which  we  will  introduce  in  the  remainder  of 
this  chapter. 

The  Earley  algorithm  (Earley,  1970)  uses  a dynamic  programming  ap- 
proach to  efficiently  implement  a parallel  top-down  search  of  the  kind  dis- 
cussed in  Section  10.1.  As  with  many  dynamic  programming  solutions,  this 
algorithm  reduces  an  apparently  exponential-time  problem  to  a polynomial- 
time one  by  eliminating  the  repetitive  solution  of  sub-problems  inherent  in 
backtracking  approaches.  In  this  case,  the  dynamic  programming  approach 
leads  to  a worst-case  behavior  of  0(N 3),  where  N is  the  number  of  words  in 
the  input. 

The  core  of  the  Earley  algorithm  is  a single  left-to-right  pass  that  fills 
an  array  called  a chart  that  has  N + 1 entries.  For  each  word  position  in 
the  sentence,  the  chart  contains  a list  of  states  representing  the  partial  parse 
trees  that  have  been  generated  so  far.  By  the  end  of  the  sentence,  the  chart 
compactly  encodes  all  the  possible  parses  of  the  input.  Each  possible  subtree 
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is  represented  only  once  and  can  thus  be  shared  by  all  the  parses  that  need  it. 

The  individual  states  contained  within  each  chart  entry  contain  three 
kinds  of  information:  a subtree  corresponding  to  a single  grammar  rule,  in- 
formation about  the  progress  made  in  completing  this  subtree,  and  the  posi- 
tion of  the  subtree  with  respect  to  the  input.  Graphically,  we  will  use  a dot 
within  the  right  hand  side  of  a state’s  grammar  rule  to  indicate  the  progress 
dotted  rule  made  in  recognizing  it.  The  resulting  structure  is  called  a dotted  rule.  A 
state’s  position  with  respect  to  the  input  will  be  represented  by  two  numbers 
indicating  where  the  state  begins  and  where  its  dot  lies.  Consider  the  follow- 
ing three  example  states,  which  would  be  among  those  created  by  the  Earley 
algorithm  in  the  course  of  parsing  (10.7): 

(10.7)  Book  that  flight,  (same  as  (10.1).) 


5 ->•  • VP,  [0,0] 

NP  — » Det  • Nominal,  [1,2] 

VP  ->  VNP .,  [0,3] 

The  first  state,  with  its  dot  to  the  left  of  its  constituent,  represents  a top- 
down  prediction  for  this  particular  kind  of  S.  The  first  0 indicates  that  the 
constituent  predicted  by  this  state  should  begin  at  the  start  of  the  input;  the 
second  0 reflects  the  fact  that  the  dot  lies  at  the  beginning  as  well.  The  second 
state,  created  at  a later  stage  in  the  processing  of  this  sentence,  indicates  that 
an  NP  begins  at  position  1,  that  a Del  has  been  successfully  parsed  and  that 
a Nominal  is  expected  next.  The  third  state,  with  its  dot  to  the  right  of  all  its 
two  constituents,  represents  the  successful  discovery  of  a tree  corresponding 
to  a VP  that  spans  the  entire  input.  These  states  can  also  be  represented 
graphically,  in  which  the  states  of  the  parse  arc  edges,  or  arcs,  and  the  chart 
as  a whole  is  a directed  acyclic  graph,  as  in  Figure  10.15. 

The  fundamental  operation  of  an  Earley  parser  is  to  march  through  the 
N + 1 sets  of  states  in  the  chart  in  a left-to-right  fashion,  processing  the  states 
within  each  set  in  order.  At  each  step,  one  of  the  three  operators  described 
below  is  applied  to  each  state  depending  on  its  status.  In  each  case,  this 
results  in  the  addition  of  new  states  to  the  end  of  either  the  current  or  next  set 
of  states  in  the  chart.  The  algorithm  always  moves  forward  through  the  chart 
making  additions  as  it  goes;  states  arc  never  removed  and  the  algorithm  never 
backtracks  to  a previous  chart  entry  once  it  has  moved  on.  The  presence  of 
a state  S — > a»,  [0./V]  in  the  list  of  states  in  the  last  chart  entry  indicates  a 
successful  parse.  Figure  10.16  gives  the  complete  algorithm. 
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The  following  three  sections  describe  in  detail  the  three  operators  used 
to  process  states  in  the  chart.  Each  takes  a single  state  as  input  and  derives 
new  states  from  it.  These  new  states  arc  then  added  to  the  chart  as  long 
as  they  arc  not  already  present.  The  PREDICTOR  and  the  COMPLETER  add 
states  to  the  chart  entry  being  processed,  while  the  SCANNER  adds  a state  to 
the  next  chart  entry. 

Predictor 

As  might  be  guessed  from  its  name,  the  job  of  the  PREDICTOR  is  to  create 
new  states  representing  top-down  expectations  generated  during  the  parsing 
process.  The  PREDICTOR  is  applied  to  any  state  that  has  a non-terminal  to 
the  right  of  the  dot  that  is  not  a part-of-speech  category.  This  application 
results  in  the  creation  of  one  new  state  for  each  alternative  expansion  of  that 
non-terminal  provided  by  the  grammar.  These  new  states  arc  placed  into  the 
same  chart  entry  as  the  generating  state.  They  begin  and  end  at  the  point  in 
the  input  where  the  generating  state  ends. 

For  example,  applying  the  PREDICTOR  to  the  state  S — > • VP.  [0,0] 
results  in  adding  the  states  VP  • Verb.  [0,0]  and  VP  -a  • Verb  NP.  [0.0]  to 
the  first  chart  entry. 

Scanner 

When  a state  has  a part-of-speech  category  to  the  right  of  the  dot,  the  SCAN- 
NER is  called  to  examine  the  input  and  incorporate  a state  corresponding 
to  the  predicted  part-of-speech  into  the  chart.  This  is  accomplished  by  cre- 
ating a new  state  from  the  input  state  with  the  dot  advanced  over  the  pre- 
dicted input  category.  Note  that  the  Earley  parser  thus  uses  top-down  input 
to  help  disambiguate  part-of-speech  ambiguities;  only  those  parts-of-speech 
of  a word  that  arc  predicted  by  some  state  will  find  their  way  into  the  chart. 


function  E A RLi: Y- Fa rs  e( wo rds,  grammar)  returns  chart 


Enqueue((y  -a  • S.  [0 ,0]),chart[0J) 
for  i A-  from  0 to  Length(  wo  rats)  do 
for  each  state  in  chart[i\  do 
if  lNCOMPLETE?(sfafe)  and 

NEXT-CAT(sfafe)  is  not  a part  of  speech  then 
PREDICTOR(stafe) 
elseif  lNCOMPLETE?(sfafe)  and 

NEXT-CAT(stofe)  is  a part  of  speech  then 
SCANNER(stofe) 
else 

COMPLETER(sfafe) 

end 

end 

retum(chart) 

procedure  Predictor((A  -a  a»B(3.  \i,j})) 

for  each  (B  -A  y)  in  Grammar-Rules-For(B, grammar)  do 
Enqueue((B  -a  • Y,  [j,j]),chart\j]) 

end 

procedure  Scanner((A  -a  a»Bp.  [1,7])) 
if  B C Pa  RTS  - O F-  S P EECH(word[j  ])  then 

Enqueue((B  -a  word[j\.  [j,j  + 1]),  chart[j+l]) 

procedure  Completer((B  ->  Y •>  L /',*])) 
for  each  (A  -A  a • B p.  [i.j])  in  cluirt[j\  do 
Enqueue((A  -a  [i,k]),chart[kj) 

end 

procedure  ENQUEUE(stofe,  chart-entry) 
if  state  is  not  already  in  chart-entry  then 
PuSH(stafe,  chart-entry) 

end 


Figure  10.16  The  Earley  algorithm. 

Returning  to  our  example,  when  the  state  VP  -A  •Verb  NP , [0, 0]  is  pro- 
cessed, the  Scanner  consults  the  current  word  in  the  input  since  the  cat- 
egory following  the  dot  is  a part-of-speech.  The  SCANNER  then  notes  that 
book  can  be  a verb,  matching  the  expectation  in  the  current  state.  This  results 
in  the  creation  of  the  new  state  VP  — » Verb  • NP.  [0, 1],  The  new  state  is  then 
added  to  the  chart  entry  that/o//ow.v  the  one  currently  being  processed. 
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Completer 

The  Completer  is  applied  to  a state  when  its  dot  has  reached  the  right 
end  of  the  rule.  Intuitively,  the  presence  of  such  a state  represents  the  fact 
that  the  parser  has  successfully  discovered  a particular  grammatical  category 
over  some  span  of  the  input.  The  purpose  of  the  COMPLETER  is  to  find  and 
advance  all  previously  created  states  that  were  looking  for  this  grammatical 
category  at  this  position  in  the  input.  New  states  are  then  created  by  copying 
the  older  state,  advancing  the  dot  over  the  expected  category  and  installing 
the  new  state  in  the  current  chart  entry. 

For  example,  when  the  state  NP  — > Det  Nominal*,  [1,3]  is  processed, 
the  Completer  looks  for  states  ending  at  1 expecting  an  NP.  In  the  current 
example,  it  will  find  the  state  VP  —>  Verb*NP,  [0, 1]  created  by  the  Scanner. 
This  results  in  the  addition  of  a new  complete  state  VP  — > Verb  NP*,  [0, 3], 

An  Example 

Figure  10.17  shows  the  sequence  of  states  created  during  the  complete  pro- 
cessing of  Example  10.1/10.7.  The  algorithm  begins  by  seeding  the  chart 
with  a top-down  expectation  for  an  S.  This  is  accomplished  by  adding  a 
dummy  state  y — > • S,  [0,0]  to  Chart[0].  When  this  state  is  processed,  it  is 
passed  to  the  PREDICTOR  leading  to  the  creation  of  the  three  states  repre- 
senting predictions  for  each  possible  type  of  S,  and  transitively  to  states  for 
all  of  the  left  corners  of  those  trees.  When  the  state  VP  -a  * Verb,  [0,0]  is 
processed,  the  SCANNER  is  called  and  the  first  word  is  consulted.  A state 
representing  the  verb  sense  of  Book  is  then  added  to  the  entry  for  Chart[l]. 
Note  that  when  the  state  VP  — > *VNP,  [0,0]  is  processed,  the  SCANNER  is 
called  again.  However,  this  time  a new  state  is  not  added  since  it  would  be 
identical  to  the  one  already  in  the  chart.  Note  also  that  since  this  admittedly 
deficient  grammar  generates  no  predictions  for  the  Noun  sense  of  Book,  no 
entries  will  be  made  for  it  in  the  chart. 

When  all  the  states  of  Chart[0]  have  been  processed,  the  algorithm 
moves  on  to  Chart[  I ] where  it  finds  the  state  representing  the  verb  sense 
of  book.  This  is  a complete  state  with  its  dot  to  the  right  of  its  constituent 
and  is  therefore  passed  to  the  Completer.  The  Completer  then  finds 
the  two  previously  existing  VP  states  expecting  a Verb  at  this  point  in  the 
input.  These  states  arc  copied  with  their  dots  advanced  and  added  to  the 
Chart[  I ].  The  completed  state  corresponding  to  an  intransitive  VP  leads  to 
the  creation  of  the  imperative  S state.  Alternatively,  the  dot  in  the  transitive 
verb  phrase  leads  to  the  creation  of  the  two  states  predicting  NPs.  Finally, 
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ChartfO] 

y — > • S [0,0]  Dummy  start  state 

5 — t • NP  VP  [0,0] 

Predictor 

NP  • Det  NOMINAL  [0,0] 

Predictor 

NP  — > • Proper-Noun  [0,0] 

Predictor 

S -A  *AuxNPVP  [0,0] 

Predictor 

S -t  *VP  [0,0] 

Predictor 

VP  -A  • Verb  [0,0] 

Predictor 

VP  — » • Verb  NP  [0,0] 

Predictor 

Chart[l] 

Verb  — » book  • [0,1 

Scanner 

VP  — > Verb • [0,1 

Completer 

S ->■  VP • [0,1] 

Completer 

VP^tVerb*NP  [0,1] 

Completer 

NP  -t  • Det  NOMINAL  [1,1] 

Predictor 

NP  — » •Proper-Noun  [1,1 

Predictor 

Chart[2] 

Det  — > that • 

[1,2] 

Scanner 

NP  ->  DefNOMINAL 

[1,2] 

Completer 

NOMINAL  -s-  • Noun 

[2,2] 

Predictor 

NOMINAL  -A  • Noun  NOMINAL 

[2,2] 

Predictor 

Chart[3] 

Noun  — » flight • 

[2,3] 

Scanner 

NOMINAL  -A  Noun • 

[2,3] 

Completer 

NOMINAL  -A  Noun*  NOMINAL 

[2,3] 

Completer 

NP  ->  Det  NOMINAL* 

[1,3] 

Completer 

VP  -t  Verb  NP  • 

[0,3] 

Completer 

S -A  VP* 

[0,3] 

Completer 

NOMINAL  -A-  • Noun 

[3,3] 

Predictor 

NOMINAL  -A-  • Noun  NOMINAL 

[3,3] 

Predictor 

Figure  10.17  Sequence  of  states  created  in 

chart  while  parsing  Book  that 

flight.  Each  entry  shows  the  state,  its  start  and  end  points,  and  the  Earley 

function  that  placed  it  in  the  chart. 
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the  state  NP  — » • Det  Nominal.  [1, 1]  causes  the  Scanner  to  consult  the  word 
that  and  add  a corresponding  state  to  Chart[2]. 

Moving  on  to  Chart[2],  the  algorithm  finds  the  state  representing  the 
determiner  sense  of  that.  This  complete  state  leads  to  the  advancement  of 
the  dot  in  the  NP  state  predicted  in  Chart[l],  and  also  to  the  predictions  for 
the  various  kinds  of  Nominal.  The  first  of  these  causes  the  SCANNER  to  be 
called  for  the  last  time  to  process  the  word  flight. 

Moving  on  to  Chart [3].  the  presence  of  the  state  representing  flight 
leads  in  quick  succession  to  the  completion  of  an  NP.  transitive  VP.  and  an 
S.  The  presence  of  the  state  S — > VP • , [0, 3]  in  the  last  chart  entry  signals  the 
discovery  of  a successful  parse. 


Retrieving  Parse  Trees  from  a Chart 


The  version  of  the  Earley  algorithm  just  described  is  actually  a recognizer  not 
a parser.  After  processing,  valid  sentences  will  leave  the  state  S — > ou.  [0,1V] 
in  the  chart.  Unfortunately,  as  it  stands  we  have  no  way  of  retrieving  the 
structure  of  this  S.  To  turn  this  algorithm  into  a parser,  we  must  be  able  to 
extract  individual  parses  from  the  chart.  To  do  this,  the  representation  of 
each  state  must  be  augmented  with  an  additional  field  to  store  information 
about  the  completed  states  that  generated  its  constituents. 

This  information  can  be  gathered  by  making  a simple  change  to  the 
Completer.  Recall  that  the  Completer  creates  new  states  by  advancing 
older  incomplete  ones  when  the  constituent  following  the  dot  is  discovered. 
The  only  change  necessary  is  to  have  COMPLETER  add  a pointer  to  the  older 
state  onto  the  list  of  previous-states  of  the  new  state.  Retrieving  a parse  tree 
from  the  chart  is  then  merely  a recursive  retrieval  starting  with  the  state  (or 
states)  representing  a complete  S in  the  final  chart  entry.  Figure  10.18  shows 
the  chart  produced  by  an  appropriately  updated  COMPLETER. 

If  there  arc  an  exponential  number  of  trees  for  a given  sentence,  the 
Earley  algorithm  can  not  magically  return  them  all  in  a polynomial  amount 
of  time.  The  best  it  can  do  is  build  the  chart  in  polynomial  time.  Figure  10. 19 
illustrates  a portion  of  the  chart  from  Figure  10.17  using  the  directed  graph 
notation.  Note  that  since  large  charts  in  this  format  can  get  rather  confusing, 
this  figure  only  includes  the  states  that  play  a role  in  the  final  parse. 
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Chart  [0] 

SO  y — » • S [0,0]  []  Dummy  start  state 


51  S *NPVP  [0,0]  []  Predictor 

52  NP  • Det  NOMINAL  [0,0]  []  Predictor 

53  NP  — » • Proper-Noun  [0,0]  []  Predictor 

54  S —f  * Aux  NP  VP  [0,0]  []  Predictor 

55  S -t  *VP  [0,0]  []  Predictor 

56  VP  — » • Verb  [0,0]  []  Predictor 

57  VP  -4  * Verb  NP  [0,0]  []  Predictor 

Chartfl] 


S8  Verb  — >•  book • [0,1]  []  Scanner 


S9  VP  — > Verb • [0,1]  [S8]  Completer 

510  S VP*  [0,1]  [S9]  Completer 

511  VP  — > Verb»NP  [0,1]  [S8]  Completer 

512  NP  -t  • Det  NOMINAL  [1,1]  []  Predictor 

513  NP  — > • Proper-Noun  [1,1]  []  Predictor 


Chart[2] 

S14 

Det  — > that* 

[1,2] 

[] 

Scanner 

S15 

NP  Det*NOMINAL 

[1,2] 

[S14] 

Completer 

S16 

NOMINAL  • Noun 

[2,2] 

[] 

Predictor 

S17 

NOMINAL  -5-  • Noun  NOMINAL 

[2,2] 

[] 

Predictor 

Chart[3] 

S18 

Noun  — > flight* 

[2,3] 

[] 

Scanner 

S19 

NOMINAL  Noun* 

[2,3] 

[S18] 

Completer 

S20 

NOMINAL  Noun*  NOMINAL 

[2,3] 

[S18] 

Completer 

521  NP  ->•  Det  NOMINAL*  [1,3]  [S14.S19]  Completer 

522  VP  -t  VerbNP*  [0,3]  [S8.S21]  Completer 

523  S VP*  [0,3]  [S22]  Completer 

524  NOMINAL  — > • Noun  [3,3]  []  Predictor 

525  NOMINAL  • Noun  NOMINAL  [3,3]  []  Predictor 


Figure  10.18  Sequence  of  states  created  in  chart  while  parsing  Book  that 
flight  including  structural  information. 
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S23 


Figure  10.19  A portion  of  the  chart  shown  in  Figure  10.17  displayed  in  a 
directed  acyclic  graph  notation. 


10.5  Finite-State  Parsing  Methods 


Some  language-processing  tasks  don’t  require  complete  parses.  For  these 
tasks,  a partial  parse  or  shallow  parse  of  the  input  sentence  may  be  suf- 
ficient. For  example,  information  extraction  algorithms  generally  do  not 
extract  all  the  possible  information  in  a text;  they  simply  extract  enough  to 
fill  out  some  sort  of  template  of  required  data.  Many  partial  parsing  sys- 
tems use  cascades  of  finite-state  automata  instead  of  context-free  grammars. 
Relying  on  simple  finite-state  automata  rather  than  full  parsing  makes  such 
systems  extremely  efficient.  Since  finite-state  systems  cannot  model  certain 
kinds  of  recursive  rules,  however,  they  trade  this  efficiency  for  a certain  lack 
of  coverage.  We  will  discuss  information  extraction  in  Chapter  15;  here  we 
just  show  how  finite-state  automata  can  be  used  to  recognize  basic  phrases, 
such  as  noun  groups,  verb  groups,  locations,  etc.  Here’s  the  output  of  the 
FASTUS  basic  phrase  identifier;  of  course  the  choice  of  which  basic  phrases 
to  produce  can  be  dependent  on  the  application: 


SHALLOW 

PARSE 


CASCADES 


BASIC 

PHRASES 
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NOUN 

GROUPS 


Company  Name: 

Bridgestone  Sports  Co. 

Verb  Group: 

said 

Noun  Group: 

Friday 

Noun  Group: 

it 

Verb  Group: 

had  set  up 

Noun  Group: 

a joint  venture 

Preposition: 

in 

Location: 

Taiwan 

Preposition: 

with 

Noun  Group: 

a local  concern 

Conjunction: 

and 

Noun  Group: 

a Japanese  trading  house 

Verb  Group: 

to  produce 

Noun  Group: 

golf  clubs 

Verb  Group: 

to  be  shipped 

Preposition: 

to 

Location: 

Japan 

These  basic  phrases  are  produced  by  a collection  finite-state  rules  com- 
piled into  a transducer.  To  give  a feel  for  how  this  works,  we’ll  give  a simpli- 
fied set  of  the  FASTUS  rules  from  Appelt  and  Israel  (1997)  used  to  build  the 
automaton  to  detect  noun  groups.  A noun  group  is  like  the  core  of  a noun 
phrase;  it  consists  of  the  head  noun  and  the  modifiers  to  the  left  (determiner, 
adjectives,  quantifiers,  numbers,  etc). 

A noun-group  can  consist  of  just  a pronoun  she,  him,  them  or  a time- 
phrase  yesterday , or  a date: 

NG  — > Pronoun  | Time-NP  | Date-NP 

It  can  also  consist  of  certain  determiners  that  can  stand  alone  (this, 
that );  or  a head  noun  ( HdNns ) preceded  by  optional  determiner  phrase  (. DETP ) 
and/or  optional  adjectives  (. Adjs ) ( the  quick  and  dirty  solution , the  frustrat- 
ing mathematics  problem)  or  a head  noun  modified  by  a gerund  phrase  ( the 
rising  index)'. 

NG  ->  (DETP)  (Adjs)  HdNns  | DETP  Ving  HdNns  | DETP-CP  (and  HdNns) 

The  parentheses  above  are  used  to  indicate  optional  elements,  while 
braces  are  used  just  for  grouping.  Determiner-phrases  come  in  two  varieties: 

DETP  ->■  DETP-CP  | DETP-INCP 

Complete  determiner-phrases  ( DETP-CP ) are  those  which  can  stand 
alone  as  an  NP,  such  as  the  only  five,  another  three,  this,  many,  hers,  all. 


Section  10.5.  Finite-State  Parsing  Methods 


385 


and  the  most.  Adv-pre-num  arc  adverbs  that  can  appeal-  before  a number  in 
the  determiner  {almost  5,  precisely  5),  while  Pro-Poss-cp  are  possessive  pro- 
nouns that  can  stand  on  their  own  as  complete  NPs  {mine,  his).  Quantifiers 
(( Q ) include  many,  few,  much,  etc. 

DETP-CP  — > ( { Adv-pre-num  | “another”  | 

{ Det  | Pro-Poss  } ({Adv-pre-num  | “only”  (“other”)})})  Number 
Q | Q-er  | (“the”)  Q-est  | “another”  | Det-cp  | DetQ  | Pro-Poss-cp 

Incomplete  determiner-phrases  ( DETP-INCP ) are  those  which  cannot 
act  as  NPs  alone,  for  example  the,  his  only,  every,  a.  Pro-Poss-incomp  are 
possessive  pronouns  which  cannot  stand  on  their  own  as  a complete  NP  (e.g. 
my,  her): 

DETP-INCP  ->  { { { Det  | Pro-Poss  } “only” 

| “a”  | “an” 

| Det-incomp 

| Pro-Poss-incomp  } (“other”) 

| (DET-CP)  “other”} 

An  adjective  sequence  {Acljs)  consists  of  one  or  more  adjectives  or  par- 
ticiples separated  by  commas  and/or  conjunctions  (e.g.  big,  bad,  and  ugly, 
or  interesting  but  outdated): 

Adjs  -4  AdjP  ( { | (“,”)  Conj  } { AdjP  | Vparticiple})  * 

Adjective  phrases  can  be  made  of  adjectives,  participles,  ordinal  num- 
bers, and  noun-verb  combinations,  like  man-eating,  and  can  be  modified 
by  comparative  and  superlative  quantifiers  (Q-er:  more,  fewer,  Q-est:  most, 
fewest).  This  rule-set  chooses  to  disallow  participles  as  the  first  word  in 
adjective -phrases  or  noun  groups,  to  avoid  incorrectly  taking  many  Verb- 
Object  combinations  as  noun  groups. 

AdjP  -4  Ordinal 

| ({Q-er  | Q-est}  { Adj  | Vparticiple  } + 

{ N[sing,!Time-NP]  (“-”)  { Vparticiple  } 

| Number  (“-”)  { “month”  | “day”  | “year”}  (“-”)  “old”} 

Nouns  can  be  conjoined  {cats  and  dogs): 

HdNns  -4  HdNn  (“and”  HdNn) 

Finally,  we  need  to  deal  with  noun-noun  compounds  and  other  noun- 
like pre-modifiers  of  nouns,  in  order  to  cover  head  noun  groups  like  gasoline 
and  oil  tanks,  California  wines,  Clinton,  and  quick-reaction  strike: 
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HdNn  — > PropN 

{ PreNs  | PropN  PreNs}  N[!Time-NP] 

{ PropN  CommonN[!Time-NP]  } 

Noun  modifiers  of  nouns  can  be  conjoined  ( gasoline  and  oil ) or  created 
via  dash  ( quick-reaction ).  Adj-noun-like  refers  to  adjectives  that  can  appear 
in  the  position  of  a prenominal  noun  (e.g.  presidential  retreat ): 

PreNs  — > PreN  (“and”  PreN2)  * 
preN  -4  ( Adj  Common-Sing-N 
preN2  -4  PreN  | Ordinal  | Adj-noun-like 

Figure  10.20  shows  an  FSA  for  the  Adjs  portion  of  the  noun-group 
recognizer,  and  an  FSA  for  the  AdjP  portion. 


£ 


Figure  10.20  A portion  of  an  FSA  grammar,  covering  conjoined  adjective 
phrases.  In  a real  automaton,  each  AdjP  node  would  actually  be  expanded  with 
a copy  of  the  AdjP  automaton  shown  in  Figure  10.21 


II  II 


Figure  10.21  A portion  of  an  FSA  grammar,  covering  the  internal  details 
of  adjective  phrases. 


The  pieces  of  automata  in  Figure  10.20  and  Figure  10.21  can  then 
be  combined  into  a single  large  Noun-Group-Recognizer  by  stalling  with 


Section  10.5.  Finite-State  Parsing  Methods 


387 


the  NG  automaton  and  iteratively  expanding  out  each  reference  to  another 
rule/automaton.  This  is  only  possible  because  none  of  these  references  arc 
recursive;  that  is,  because  the  expansion  of  AdjP  doesn’t  refer  to  AdjP. 

Page  345,  however,  showed  that  a more  complete  grammar  of  English 
requires  this  kind  of  recursion.  Recall,  for  example,  that  a complete  defini- 
tion of  NP  needs  to  refer  to  other  /VPs  in  the  rules  for  relative  clauses  and 
other  post-nominal  modifiers. 

One  way  to  handle  recursion  is  by  allowing  only  a limited  amount 
of  recursion;  this  is  what  FASTUS  does,  by  using  its  automata  cascade. 

The  second  level  of  FASTUS  finds  non-recursive  noun  groups;  the  third 
level  combines  these  groups  into  larger  /VP- 1 ike  units  by  adding  on  measure 
phrases: 

20,000  iron  and  “metal  wood”  clubs  a month, 
attaching  preposition  phrases: 

production  of  20,000  iron  and  “metal  wood”  clubs  a month, 
and  dealing  with  noun  group  conjunction: 

a local  concern  and  a Japanese  trading  house 

In  a single  level  system,  each  of  these  phenomena  would  require  recur- 
sive rules  (e.g.  NP  — > NP  and  NP).  By  splitting  the  parsing  into  two  levels, 
FASTUS  essentially  treats  the  NP  on  the  left-hand  side  as  a different  kind  of 
object  from  the  /VPs  on  the  right-hand  side. 

A second  method  for  dealing  with  recursion  is  to  use  a model  which 
looks  finite-state  but  isn’t.  One  such  model  is  the  Recursive  Transition 
Network  or  RTN.  An  RTN  is  defined  by  a set  of  graphs  like  those  in  Fig-  rtn 
ure  10.20  and  Figure  10.21,  in  which  each  arc  contains  a terminal  or  non- 
terminal node.  The  difference  between  an  RTN  and  an  FSA  lies  in  how  the 
non-terminals  arc  handled.  In  an  RTN,  every  time  the  machine  comes  to  an 
arc  labeled  with  a non-terminal,  it  treats  that  non-terminal  as  a sub-routine. 

It  places  its  current  location  onto  a stack,  jumps  to  the  non-terminal,  and 
then  jumps  back  when  that  non-terminal  has  been  parsed.  If  a rule  for  NP 
contains  a self-reference,  the  RTN  once  again  puts  the  current  location  on  a 
stack  and  jumps  back  to  the  beginning  of  the  NR 

Since  an  RTN  is  exactly  equivalent  to  a context-free  grammar,  travers- 
ing an  RTN  can  thus  be  thought  of  as  a graphical  way  to  view  a simple 
top-down  parser  for  context-free  rules.  RTNs  arc  most  often  used  as  a con- 
venient graphical  metaphor  when  displaying  or  describing  grammars,  or  as 
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a way  to  implement  a system  which  has  a small  amount  of  recursion  but  is 
otherwise  finite-state. 


10.6  Summary 

This  chapter  introduced  a lot  of  material.  The  most  important  two  ideas  arc 
those  of  parsing  and  partial  parsing.  Here’s  a summary  of  the  main  points 
we  covered  about  these  ideas: 

• Parsing  can  be  viewed  as  a search  problem. 

• Two  common  architectural  metaphors  for  this  search  arc  top-down 
(starting  with  the  root  S and  growing  trees  down  to  the  input  words) 
and  bottom-up  (staling  with  the  words  and  growing  trees  up  toward 
the  root  S ). 

• One  simple  parsing  algorithm  is  the  top-down  depth-first  left-to-right 
parser  of  Figure  10.6  on  page  362. 

• Top  down  parsers  can  be  made  more  efficient  by  using  a left-corner 
table  to  only  suggest  non-terminals  which  arc  compatible  with  the  in- 
put. 

• Ambiguity,  left-recursion,  and  repeated  parsing  of  sub-trees  all  pose 
problems  for  this  simple  parsing  algorithm. 

• A sentence  is  structurally  ambiguous  if  the  grammar  assigns  it  more 
than  one  possible  parse. 

• Common  kinds  of  structural  ambiguity  include  PP-attachment,  coor- 
dination ambiguity  and  noun-phrase  bracketing  ambiguity. 

• The  dynamic  programming  parsing  algorithms  use  a table  of  partial- 
parses  to  efficiently  parse  ambiguous  sentences.  The  Earley  algorithm 
is  a top-down  dynamic-programming  algorithm,  while  the  CYK  algo- 
rithm is  bottom  up. 

• Certain  information  extraction  problems  can  be  solved  without  full 
parsing.  These  arc  often  addressed  via  FSA  cascades. 


Bibliographical  and  Historical  Notes 


Writing  about  the  history  of  compilers,  Knuth  notes: 


Section  10.6.  Summary 


389 


In  this  field  there  has  been  an  unusual  amount  of  parallel  discov- 
ery of  the  same  technique  by  people  working  independently. 

Indeed,  the  problem  of  identifying  the  first  appearance  of  various  parsing 
ideas  recalls  K rusk al’ s (1983)  comment  about  the  ‘remarkable  history  of 
multiple  independent  discovery  and  publication’  of  dynamic  programming 
algorithms  for  sequence  comparison.  This  history  will  therefore  error  on 
the  side  of  succinctness  in  giving  only  a characteristic  early  mention  of  each 
algorithm;  the  interested  reader  should  see  Aho  and  Ullman  (1972). 

Bottom-up  parsing  seems  to  have  been  first  described  by  Yngve  (1955), 
who  gave  a breadth-first  bottom-up  parsing  algorithm  as  paid  of  an  illustra- 
tion of  a machine  translation  procedure.  Top-down  approaches  to  parsing 
and  translation  was  described  (presumably  independently)  by  at  least  Glen- 
nie (1960),  Irons  (1961),  and  Kuno  and  Oettinger  (1962).  Dynamic  pro- 
gramming parsing,  once  again,  has  a history  of  independent  discovery.  Ac- 
cording to  Martin  Kay  (p.c.),  a dynamic  programming  parser  containing  the 
roots  of  the  CYK  algorithm  was  first  implemented  by  John  Cocke  in  1960. 

Later  work  extended  and  formalized  the  algorithm,  as  well  as  proving  its 
time  complexity  (Kay,  1967;  Younger,  1967;  Kasami,  1965).  The  related 
well-formed  substring  table  (WFST)  seems  to  have  been  independently  wfst 
proposed  by  Kuno  (1965),  as  a data  structure  which  stores  the  results  of  all 
previous  computations  in  the  course  of  the  parse.  Based  on  a generalization 
of  Cocke’s  work,  a similar  data-structure  had  been  independently  described 
by  Kay  (1967)  and  Kay  (1973).  The  top-down  application  of  dynamic  pro- 
gramming to  parsing  was  described  in  Earley’s  Ph.D.  thesis  (Earley,  1968) 
and  Earley  (1970).  Sheil  (1976)  showed  the  equivalence  of  the  WFST  and 
the  Earley  algorithm.  Norvig  (1991)  shows  that  the  efficiency  offered  by 
all  of  these  dynamic  programming  algorithms  can  be  captured  in  any  lan- 
guage with  a memoization  function  (such  as  LISP)  simply  by  wrapping  the 
memoization  operation  around  a simple  top-down  parser. 

While  parsing  via  cascades  of  finite-state  automata  had  been  com- 
mon in  the  early  history  of  parsing  (Harris,  1962),  the  focus  shifted  to  full 
CFG  parsing  quite  soon  afterwards.  Church  (1980)  argued  for  a return  to 
finite-state  grammars  as  a processing  model  for  natural  language  understand- 
ing; Other  early  finite-state  parsing  models  include  Ejerhed  (1988).  Abney 
(1991)  argued  for  the  important  practical  role  of  shallow  parsing.  Much  re- 
cent work  on  shallow  parsing  applies  machine  learning  to  the  task  of  learning 
the  patterns;  see  for  example  Ramshaw  and  Marcus  (1995),  Shlomo  Arga- 
mon  (1998),  and  Munoz  et  al.  (1999). 
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The  classic  reference  for  parsing  algorithms  is  Aho  and  Ullman  (1972); 
although  the  focus  of  that  book  is  on  computer  languages,  most  of  the  algo- 
rithms have  been  applied  to  natural  language.  A good  programming  lan- 
guages textbook  such  as  Aho  et  al.  (1986)  is  also  useful. 


Exercises 

10.1  Modify  the  top-down  parser  in  Figure  10.7  to  add  bottom-up  filtering. 
You  can  assume  the  use  of  a left-corner  table  like  the  one  on  page  366. 

10.2  Write  an  algorithm  for  eliminating  left-recursion  based  on  the  intu- 
ition on  page  368. 

10.3  Implement  the  finite-state  grammar  for  noun-groups  described  on  pages 
384-387.  Test  it  on  some  sample  noun-phrases.  If  you  have  access  to  an  on- 
line dictionary  with  part-of-speech  information,  start  with  that;  if  not,  build 
a more  restricted  system  by  hand. 

10.4  Augment  the  Earley  algorithm  of  Figure  10.16  to  enable  parse  trees 
to  be  retrieved  from  the  chart  by  modifying  the  pseudocode  for  the  Com- 
pleter as  described  on  page  381. 

10.5  Implement  the  Earley  algorithm  as  augmented  in  the  previous  ex- 
ercise. of  Figure  10.16.  Check  it  on  a test  sentence  using  a baby  gram- 
mar. 

10.6  Discuss  the  relative  advantages  and  disadvantages  of  partial  parsing 
versus  full  parsing. 

10.7  Discuss  how  you  would  augment  a parser  to  deal  with  input  that  may 
be  be  incorrect,  such  as  spelling  errors  or  misrecognitions  from  a speech 
recognition  system. 


FEATURES  AND 
UNIFICATION 


FRIAR  FRANCIS:  If  either  of  you  know  any  inward  impediment 
why  you  should  not  be  conjoined,  charge  you,  on  your  souls, 
to  utter  it. 


William  Shakespeare,  Much  Ado  About  Nothing 


From  a reductionist  perspective,  the  history  of  the  natural  sciences  over  the 
last  few  hundred  years  can  be  seen  as  an  attempt  to  explain  the  behavior  of 
larger  structures  by  the  combined  action  of  smaller  primitives.  In  biology, 
the  properties  of  inheritance  have  been  explained  by  the  action  of  genes, 
and  then  again  the  properties  of  genes  have  been  explained  by  the  action  of 
DNA.  In  physics,  matter  was  reduced  to  atoms  and  then  again  to  subatomic 
particles.  The  appeal  of  reductionism  has  not  escaped  computational  lin- 
guistics. In  this  chapter  we  introduce  the  idea  that  grammatical  categories 
like  VPto,  Sthat,  Non3sgAux,  or  3sgNP,  as  well  as  the  grammatical  rules  like 
S — > NP  VP  that  make  use  of  them,  should  be  thought  of  as  objects  that  can 
have  complex  sets  of  properties  associated  with  them.  The  information  in 
these  properties  is  represented  by  constraints,  and  so  these  kinds  of  models 
are  often  called  constraint-based  formalisms. 

Why  do  we  need  a more  fine-grained  way  of  representing  and  plac- 
ing constraints  on  grammatical  categories?  One  problem  arose  in  Chapter  9, 
where  we  saw  that  naive  models  of  grammatical  phenomena  such  as  agree- 
ment and  subcategorization  can  lead  to  overgeneration  problems.  For  exam- 
ple, in  order  to  avoid  ungrammatical  noun  phrases  such  as  this  flights  and 
verb  phrases  like  disappeared  a flight , we  were  forced  to  create  a huge  pro- 
liferation of  primitive  grammatical  categories  such  as  Non3sgVPto,  NPmass, 
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3sgNP  and  Non3sgAux.  These  new  categories  led,  in  turn,  to  an  explosion  in 
the  number  of  grammar  rules  and  a corresponding  loss  of  generality  in  the 
grammar.  A constraint-based  representation  scheme  will  allow  us  to  repre- 
sent fine-grained  information  about  number  and  person,  agreement,  subcat- 
egorization, as  well  as  semantic  categories  like  mass/count. 

Constraint-based  formalisms  have  other  advantages  that  we  will  not 
cover  in  this  chapter,  such  as  the  ability  to  model  more  complex  phenomena 
than  context-free  grammars,  and  the  ability  to  efficiently  and  conveniently 
compute  semantics  for  syntactic  representations. 

Consider  briefly  how  this  approach  might  work  in  the  case  of  grammat- 
ical number.  As  we  saw  in  Chapter  9,  noun  phrases  like  this  flight  and  those 
flights  can  be  distinguished  based  on  whether  they  arc  singular  or  plural. 
This  distinction  can  be  captured  if  we  associate  a property  called  NUMBER 
that  can  have  the  value  singular  or  plural,  with  appropriate  members  of  the 
NP  category.  Given  this  ability,  we  can  say  that  this  flight  is  a member  of  the 
NP  category  and,  in  addition,  has  the  value  singular  for  its  NUMBER  prop- 
erty. This  same  property  can  be  used  in  the  same  way  to  distinguish  singular 
and  plural  members  of  the  VP  category  such  as  serves  lunch  and  serve  lunch. 

Of  course,  simply  associating  these  properties  with  various  words  and 
phrases  does  not  solve  any  of  our  overgeneration  problems.  To  make  these 
properties  useful,  we  need  the  ability  to  perform  simple  operations,  such  as 
equality  tests,  on  them.  By  pairing  such  tests  with  our  core  grammar  rules, 
we  can  add  various  constraints  to  help  ensure  that  only  grammatical  strings 
arc  generated  by  the  grammar.  For  example,  we  might  want  to  ask  whether 
or  not  a given  noun  phrase  and  verb  phrase  have  the  same  values  for  their 
respective  number  properties.  Such  a test  is  illustrated  by  the  following  kind 
of  rule. 

S — s-  NP  VP 

Only  if  the  number  of  the  NP  is  equal  to  the  number  of  the  VP. 

The  remainder  of  this  chapter  provides  the  details  of  one  computational 
implementation  of  a constraint-based  formalism,  based  on  feature  struc- 
tures and  unification.  The  next  section  describes  feature  structures,  the 
representation  used  to  capture  the  kind  of  grammatical  properties  we  have  in 
mind.  Section  11.2  then  introduces  the  unification  operator  that  is  used  to 
implement  basic  operations  over  feature  structures.  Section  11.3  then  cov- 
ers the  integration  of  these  structures  into  a grammatical  formalism.  Section 
11.4  then  introduces  the  unification  algorithm  and  its  required  data  struc- 
tures. Next,  Section  11.5  describes  how  feature  structures  and  the  unifica- 
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tion  operator  can  be  integrated  into  a parser.  Finally,  Section  11.6  discusses 
the  most  significant  extension  to  this  constraint-based  formalism,  the  use  of 
types  and  inheritance,  as  well  as  other  extensions. 


11.1  Feature  Structures 

One  of  the  simplest  ways  to  encode  the  kind  of  properties  that  we  have  in 
mind  is  through  the  use  of  feature  structures.  These  are  simply  sets  of 
feature-value  pairs,  where  features  arc  unanalyzable  atomic  symbols  drawn 
from  some  finite  set,  and  values  arc  either  atomic  symbols  or  feature  struc- 
tures. Such  feature  structures  arc  traditionally  illustrated  with  the  following 
kind  of  matrix-like  diagram. 


FEATURE i 

VALUEi 

feature2 

VALUE2 

FEATURE,, 

VALUE,, 

To  be  concrete,  let  us  consider  the  number  property  discussed  above. 
To  capture  this  property,  we  will  use  the  symbol  NUMBER  to  designate  this 
grammatical  attribute,  and  the  symbols  SG  and  PL  (introduced  in  Chapter  3) 
to  designate  the  possible  values  it  can  take  on  in  English.  A simple  feature 
structure  consisting  of  this  single  feature  would  then  be  illustrated  as  follows. 

NUMBER  SG 

Adding  an  additional  feature-value  pair  to  capture  the  grammatical  notion  of 
person  leads  to  the  following  feature  structure. 

NUMBER  SG 
PERSON  3 

Next  we  can  encode  the  grammatical  category  of  the  constituent  that  this 
structure  corresponds  to  through  the  use  of  the  CAT  feature.  For  example, 
we  can  indicate  that  these  features  arc  associated  with  a noun  phrase  by 
using  the  following  structure. 

CAT  NP 

NUMBER  SG 
PERSON  3 


FEATURE 
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FEATURE 

PATH 


This  structure  can  be  used  to  represent  the  3sgNP  category  introduced  in 
Chapter  9 to  capture  a restricted  subcategory  of  noun  phrases.  The  corre- 
sponding plural  version  of  this  structure  would  be  captured  as  follows. 


CAT  NP 

NUMBER  PL 
PERSON  3 


Note  that  the  value  of  the  CAT  and  PERSON  features  remains  the  same  for 
these  last  two  structures.  This  illustrates  how  the  use  of  feature  structures 
allows  us  to  both  preserve  the  core  set  of  grammatical  categories  and  draw 
distinctions  among  members  of  a single  category. 

As  mentioned  earlier  in  the  definition  of  feature  structures,  features 
arc  not  limited  to  atomic  symbols  as  their  values;  they  can  also  have  other 
feature  structures  as  their  values.  This  is  particularly  useful  when  we  wish 
to  bundle  a set  of  feature-value  pairs  together  for  similar  treatment.  As  an 
example  of  this,  consider  that  the  NUMBER  and  PERSON  features  are  often 
lumped  together  since  grammatical  subjects  must  agree  with  their  predicates 
in  both  their  number  and  person  properties.  This  lumping  together  can  be 
captured  by  introducing  an  AGREEMENT  feature  that  takes  a feature  struc- 
ture consisting  of  the  NUMBER  and  PERSON  feature-value  pairs  as  its  value. 
Introducing  this  feature  into  our  third  person  singular  noun  phrase  yields  the 
following  kind  of  structure. 


CAT  NP 


AGREEMENT 


NUMBER  SG 
PERSON  3 


Given  this  kind  of  arrangement,  we  can  test  for  the  equality  of  the  values  for 
both  the  NUMBER  and  PERSON  features  of  two  constituents  by  testing  for 
the  equality  of  their  AGREEMENT  features. 

This  ability  to  use  feature  structures  as  values  leads  fairly  directly  to 
the  notion  of  a feature  path.  A feature  path  is  nothing  more  than  a list  of 
features  through  a feature  structure  leading  to  a particular  value.  For  exam- 
ple, in  the  last  feature  structure,  we  can  say  that  the  (AGREEMENT  number) 
path  leads  to  the  value  SG,  while  the  (AGREEMENT  PERSON)  path  leads  to 
the  value  3.  This  notion  of  a path  leads  naturally  to  an  alternative  graph- 
ical way  of  illustrating  features  structures,  shown  in  Figure  11.1,  which  as 
we  will  see  in  Section  11.4  is  suggestive  of  how  they  will  be  implemented. 
In  these  diagrams,  feature  structures  are  depicted  as  directed  graphs  where 
features  appeal-  as  labeled  edges  and  values  as  nodes. 
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Figure  11.1  A directed  graph  notation  for  feature  structures. 


Although  this  notion  of  paths  will  prove  useful  in  a number  of  set- 
tings, we  introduce  it  here  to  help  explain  an  additional  important  kind  of 
feature  structure:  those  that  contain  features  that  actually  share  some  feature 
structure  as  a value.  Such  feature  structures  will  be  referred  to  as  reentrant  reentrant 
structures.  What  we  have  in  mind  here  is  not  the  simple  idea  that  two  fea- 
tures might  have  equal  values,  but  rather  that  they  share  precisely  the  same 
feature  structure  (or  node  in  the  graph).  These  two  cases  can  be  distinguished 
clearly  if  we  think  in  terms  of  paths  through  a graph.  In  the  case  of  simple 
equality,  two  paths  lead  to  distinct  nodes  in  the  graph  that  anchor  identical, 
but  distinct  structures.  In  the  case  of  a reentrant  structure,  two  feature  paths 
actually  lead  to  the  same  node  in  the  structure. 

Figure  1 1.2  illustrates  a simple  example  of  reentrancy.  In  this  structure, 
the  (head  subject  agreement)  path  and  the  (head  agreement)  path 
lead  to  the  same  location.  Shared  structures  like  this  will  be  denoted  in  our 
matrix  diagrams  by  adding  numerical  indexes  that  signal  the  values  to  be 
shared.  The  matrix  version  of  the  feature  structure  from  Figure  11.2  would 
be  denoted  as  follows,  using  the  notation  of  the  PATR-II  system  (Shieber, 

1986),  based  on  Kay  (1979): 


CAT  S 


HEAD 

AGREEMENT 

□ 

NUMBER 

PERSON 

SG 

3 

SUBJECT 

AGREEMENT 

□ 
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Figure  11.2  A feature  structure  with  shared  values.  The  location  (value) 
found  by  following  the  (HEAD  SUBJECT  AGREEMENT)  path  is  the  same  as 
that  found  via  the  (HEAD  AGREEMENT)  path. 


As  we  will  see,  these  simple  structures  give  us  the  ability  to  express 
linguistic  generalizations  in  surprisingly  compact  and  elegant  ways. 


11.2  Unification  of  Feature  Structures 


As  noted  earlier,  feature  structures  would  be  of  little  use  without  our  being 
able  to  perform  reasonably  efficient  and  powerful  operations  on  them.  As  we 
will  show,  the  two  principal  operations  we  need  to  perform  are  merging  the 
information  content  of  two  structures  and  rejecting  the  merger  of  structures 
that  arc  incompatible.  Fortunately,  a single  computational  technique,  called 
unification  unification,  suffices  for  both  of  these  purposes.  The  bulk  of  this  section 
will  illustrate  through  a series  of  examples  how  unification  instantiates  these 
notions  of  merger  and  compatibility.  Discussion  of  the  unification  algorithm 
and  its  implementation  will  be  deferred  to  Section  1 1.4. 

We  begin  with  the  following  simple  application  of  the  unification  op- 
erator. 


NUMBER 


SG 


U 


NUMBER 


SG 


NUMBER  SG 


As  this  equation  illustrates,  unification  is  implemented  as  a binary  operator 
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(represented  here  as  U)  that  accepts  two  feature  structures  as  arguments  and 
returns  a feature  structure  when  it  succeeds.  In  this  example,  unification  is 
being  used  to  perform  a simple  equality  check.  The  unification  succeeds 
because  the  corresponding  NUMBER  features  in  each  structure  agree  as  to 
their  values.  In  this  case,  since  the  original  structures  arc  identical,  the  output 
is  the  same  as  the  input.  The  following  similar  kind  of  check  fails  since  the 
NUMBER  features  in  the  two  structures  have  incompatible  values. 


NUMBER 


SG 


U 


NUMBER 


PL 


Fails! 


This  next  unification  illustrates  an  important  aspect  of  the  notion  of 
compatibility  in  unification. 


NUMBER 


SG 


U 


NUMBER 


[] 


NUMBER  SG 


In  this  situation,  these  features  structures  arc  taken  to  be  compatible,  and 
arc  hence  capable  of  being  merged,  despite  the  fact  that  the  given  values  for 
the  respective  NUMBER  features  arc  different.  The  []  value  in  the  second 
structure  indicates  that  the  value  has  been  left  unspecified.  A feature  with 
such  a []  value  can  be  successfully  matched  to  any  value  in  a corresponding 
feature  in  another  structure.  Therefore,  in  this  case,  the  value  SG  from  the 
first  structure  can  match  the  []  value  from  the  second,  and  as  is  indicated  by 
the  output  shown,  the  result  of  this  type  of  unification  is  a structure  with  the 
value  provided  by  the  more  specific,  non-null,  value. 

The  next  example  illustrates  another  of  the  merger  aspects  of  unifica- 
tion. 


NUMBER  SG 

U 

PERSON  3 

= 

NUMBER 

SG 

PERSON 

3 

Here  the  result  of  the  unification  is  a merger  of  the  original  two  structures 
into  one  larger  structure.  This  larger  structure  contains  the  union  of  all  the 
information  stored  in  each  of  the  original  structures.  Although  this  is  a sim- 
ple example,  it  is  important  to  understand  why  these  structures  are  judged  to 
be  compatible:  they  are  compatible  because  they  contain  no  features  that  are 
explicitly  incompatible.  The  fact  that  they  each  contain  a feature-value  pair 
that  the  other  does  not  is  not  a reason  for  the  unification  to  fail. 

We  will  now  consider  a series  of  cases  involving  the  unification  of 
somewhat  more  complex  reentrant  structures.  The  following  example  illus- 
trates an  equality  check  complicated  by  the  presence  of  a reentrant  structure 
in  the  first  argument. 
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AGREEMENT 


NUMBER  SG 
PERSON  3 


SUBJECT 


AGREEMENT  0 


U 


SUBJECT 


AGREEMENT 


PERSON 

NUMBER 


3 

SG 


AGREEMENT 


m 


NUMBER 

PERSON 


SG 

3 


SUBJECT 


AGREEMENT  0 


The  important  elements  in  this  example  arc  the  SUBJECT  features  in  the  two 
input  structures.  The  unification  of  these  features  succeeds  because  the  val- 
ues found  in  the  first  argument  by  following  the  0 numerical  index,  match 
those  that  are  directly  present  in  the  second  argument.  Note  that,  by  itself, 
the  value  of  the  AGREEMENT  feature  in  the  first  argument  would  have  no 
hearing  on  the  success  of  unification  since  the  second  argument  lacks  an 
AGREEMENT  feature  at  the  top  level.  It  only  becomes  relevant  because  the 
value  of  the  AGREEMENT  feature  is  shared  with  the  SUBJECT  feature. 

The  following  example  illustrates  the  copying  capabilities  of  unifica- 
tion. 


(11.1) 


AGREEMENT 

SUBJECT 


m 

[agreement  0 


SUBJECT 


AGREEMENT 


PERSON  3 
NUMBER  SG 


AGREEMENT 


m 


SUBJECT 


AGREEMENT 


□ 


PERSON 

NUMBER 


3 

SG 


Here  the  value  found  via  the  second  argument’s  (SUBJECT  AGREEMENT) 
feature  is  copied  over  to  the  corresponding  place  in  the  first  argument.  In 
addition,  the  AGREEMENT  feature  of  the  first  argument  receives  a value  as 
a side-effect  of  the  index  linking  it  to  the  end  of  (SUBJECT  AGREEMENT) 
feature. 
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The  next  example  demonstrates  the  important  difference  between  fea- 
tures that  actually  share  values  versus  those  that  merely  have  similar  values. 

(11.2) 


AGREEMENT 


SUBJECT 


NUMBER  SG 


AGREEMENT  NUMBER  SG 


U 


SUBJECT 


AGREEMENT 


PERSON 

NUMBER 


3 

SG 


AGREEMENT 


NUMBER  SG 


SUBJECT 


AGREEMENT 


NUMBER  SG 
PERSON  3 


The  values  at  the  end  of  the  (SUBJECT  AGREEMENT)  path  and  the 
(agreement)  path  arc  the  same,  but  not  shared,  in  the  first  argument.  The 
unification  of  the  SUBJECT  features  of  the  two  arguments  adds  the  PERSON 
information  from  the  second  argument  to  the  result.  However,  since  there 
is  no  index  linking  the  AGREEMENT  feature  to  the  (SUBJECT  AGREEMENT) 
feature,  this  information  is  not  added  to  the  value  of  the  AGREEMENT  fea- 
ture. 


Finally,  consider  the  following  example  of  a failure  to  unify. 


AGREEMENT 


m 


NUMBER 

PERSON 


SG 

3 


SUBJECT 


AGREEMENT  Q] 


U 


AGREEMENT 


NUMBER  SG 
PERSON  3 


SUBJECT 

Fails! 


AGREEMENT 


NUMBER  PL 
PERSON  3 


Proceeding  through  the  features  in  order,  we  first  find  that  the  AGREEMENT 
features  in  these  examples  successfully  match.  However,  when  we  move 
on  to  the  SUBJECT  features,  we  find  that  the  values  found  at  the  end  of  the 
respective  ( SUBJECT  AGREEMENT  NUMBER  ) paths  differ,  causing  a unifi- 
cation failure. 
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Feature  structures  arc  a way  of  representing  partial  information  about 
some  linguistic  object  or  placing  informational  constraints  on  what  the  object 
can  be.  Unification  can  be  seen  as  a way  of  merging  the  information  in  each 
feature  structure,  or  describing  objects  which  satisfy  both  sets  of  constraints. 
Intuitively,  unifying  two  feature  structures  produces  a new  feature  structure 
which  is  more  specific  (has  more  information)  than,  or  is  identical  to,  either 
of  the  input  feature  structures.  We  say  that  a less  specific  (more  abstract) 
subsumes  feature  structure  subsumes  an  equally  or  more  specific  one.  Subsumption 
is  represented  by  the  operator  C.  A feature  structure  F subsumes  a feature 
structure  G (F  C G)  if  and  only  if: 

1.  for  every  feature  x in  F,  F(x)  C G(x)  . (where  F(x)  means  ‘the  value 
of  the  feature  x of  feature  structure  F') 

2.  for  all  paths  p and  q in  F such  that  F(p)  = F(q),  it  is  also  the  case  that 
G(p)=G(q). 

For  example,  consider  these  feature  structures: 

(11.3)  NUMBER  SG 

(11.4)  PERSON  3 

(11.5)  NUMBER  SG 
PERSON  3 

(11.6)  [cat  VP 

AGREEMENT  0 
SUBJECT  AGREEMENT  0 

(11.7)  [cat  VP 

AGREEMENT  0 

PERSON  3 

SUBJECT  AGREEMENT 

NUMBER  SG 

(11.8)  [cat  VP 

AGREEMENT  0 

PERSON  3 

SUBJECT  AGREEMENT  0 

NUMBER  SG 
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The  following  subsumption  relations  hold  among  them: 

11.3  U 11.5 

11.4  C 11.5 
11.6  C 11.7  C 11.8 


Subsumption  is  a parti al  ordering;  there  are  pairs  of  feature  structures 
that  neither  subsume  nor  are  subsumed  by  each  other: 

11.3  % 11.4 

11.4  % 11.3 

Since  every  feature  structure  is  subsumed  by  the  empty  structure  [], 
the  relation  among  feature  structures  can  be  defined  as  a semilattice.  The 
semilattice  is  often  represented  pictorially  with  the  most  general  feature  [] 
at  the  top  and  the  subsumption  relation  represented  by  lines  between  feature 
structures.  Unification  can  be  defined  in  terms  of  the  subsumption  semilat- 
tice. Given  two  feature  structures  F and  G,  TUG  is  defined  as  the  most 
general  feature  structure  H s.t.  F C H and  G T H . Since  the  information 
ordering  defined  by  unification  is  a semilattice,  the  unification  operation  is 
monotonic  (Pereira  and  Shieber,  1984;  Rounds  and  Kasper,  1986;  Moshier, 
1988).  This  means  that  if  some  description  is  true  of  a feature  structure, 
unifying  it  with  another  feature  structure  results  in  a feature  structure  that 
still  satisfies  the  original  description.  The  unification  operation  is  therefore 
order-independent;  given  a set  of  feature  structures  to  unify,  we  can  check 
them  in  any  order  and  get  the  same  result.  Thus  in  the  above  example  we 
could  instead  have  chosen  to  check  the  AGREEMENT  attribute  first  and  the 
unification  still  would  have  failed. 

To  summarize,  unification  is  a way  of  implementing  the  integration  of 
knowledge  from  different  constraints.  Given  two  compatible  feature  struc- 
tures as  input,  it  produces  the  most  general  feature  structure  which  nonethe- 
less contains  all  the  information  in  the  inputs.  Given  two  incompatible  fea- 
ture structures,  it  fails. 

11.3  Features  Structures  in  the  Grammar 

Our  primary  purpose  in  introducing  feature  structures  and  unification  has 
been  to  provide  a way  to  elegantly  express  syntactic  constraints  that  would 


SEMILATTICE 


MONOTONIC 
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be  difficult  to  express  using  the  mechanisms  of  context-free  grammars  alone. 
Our  next  step,  therefore,  is  to  specify  a way  to  integrate  feature  structures 
and  unification  operations  into  the  specification  of  a grammar.  This  can  be 
accomplished  by  augmenting  the  rules  of  ordinary  context-free  grammars 
with  attachments  that  specify  feature  structures  for  the  constituents  of  the 
rules,  along  with  appropriate  unification  operations  that  express  constraints 
on  those  constituents.  From  a grammatical  point  of  view,  these  attachments 
will  be  used  to  accomplish  the  following  goals: 

• To  associate  complex  feature  structures  with  both  lexical  items  and 
instances  of  grammatical  categories. 

• To  guide  the  composition  of  feature  structures  for  larger  grammatical 
constituents  based  on  the  feature  structures  of  their  component  parts. 

• To  enforce  compatibility  constraints  between  specified  parts  of  gram- 
matical constructions. 

We  will  use  the  following  notation  to  denote  the  grammar  augmenta- 
tions that  will  allow  us  to  accomplish  all  of  these  goals,  based  on  the  PATR-II 
system  described  in  Shieber  (1986): 

Po  — > Pi • • -Pn 

{set  of  constraints} 

The  specified  constraints  have  one  of  the  following  forms. 

(P,-  feature  path ) = Atomic  value 
(P,-  feature  path)  = (P j feature  path) 

The  notation  (p;  feature  path)  denotes  a feature  path  through  the  feature 
structure  associated  with  the  P,  component  of  the  context-free  paid  of  the 
rule.  The  first  style  of  constraint  specifies  that  the  value  found  at  the  end 
of  the  given  path  must  unify  with  the  specified  atomic  value.  The  second 
form  specifies  that  the  values  found  at  the  end  of  the  two  given  paths  must 
be  unifiable. 

To  illustrate  the  use  of  these  constraints,  let  us  return  to  the  informal 
solution  to  the  number  agreement  problem  proposed  at  the  beginning  of  this 
chapter. 

S — s-  NP  VP 

Only  if  the  number  of  the  NP  is  equal  to  the  number  of  the  VP. 
Using  the  new  notation,  this  rule  can  now  be  expressed  as  follows. 

S — > NP  VP 

(NP  number)  = (VP  number) 
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Note  that  in  cases  where  there  arc  two  or  more  constituents  of  the  same 
syntactic  category  in  a rule,  we  will  subscript  the  constituents  to  keep  them 
straight,  as  in  VP  — » VNP\  NP2 

Taking  a step  back  from  the  notation,  it  is  important  to  note  that  in 
this  approach  the  simple  generative  nature  of  context-free  rules  has  been 
fundamentally  changed  by  this  augmentation.  Ordinary  context-free  rules 
arc  based  on  the  simple  notion  of  concatenation;  an  NP  followed  by  a VP 
is  an  S,  or  generatively,  to  produce  an  S all  we  need  to  do  is  concatenate  an 
NP  to  a VP.  In  the  new  scheme,  this  concatenation  must  be  accompanied  by 
a successful  unification  operation.  This  leads  naturally  to  questions  about 
the  computational  complexity  of  the  unification  operation  and  its  effect  on 
the  generative  power  of  this  new  grammar.  These  issues  will  be  discussed  in 
detail  in  Chapter  13. 

To  review,  there  arc  two  fundamental  components  to  this  approach. 

• The  elements  of  context-free  grammar  rules  will  have  feature -based 
constraints  associated  with  them.  This  reflects  a shift  from  atomic 
grammatical  categories  to  more  complex  categories  with  properties. 

• The  constraints  associated  with  individual  rules  can  refer  to,  and  ma- 
nipulate, the  feature  structures  associated  with  the  parts  of  the  rule  to 
which  they  arc  attached. 

The  following  sections  present  applications  of  unification  constraints 
to  four  interesting  linguistic  phenomena:  agreement,  grammatical  heads, 
subcategorization,  and  long  distance  dependencies. 

Agreement 

As  discussed  in  Chapter  9,  agreement  phenomena  show  up  in  a number 
of  different  places  in  English.  This  section  illustrates  how  unification  can 
be  used  to  capture  the  two  main  types  of  English  agreement  phenomena: 
subject-verb  agreement  and  determiner-nominal  agreement.  We  will  use  the 
following  ATIS  sentences  as  examples  throughout  this  discussion  to  illus- 
trate these  phenomena. 

(11.9)  This  flight  serves  breakfast. 

(11.10)  Does  this  flight  serve  breakfast? 

(IE  11)  Do  these  flights  serve  breakfast? 

Notice  that  the  constraint  used  to  enforce  SUBJECT- VERB  agreement 
given  above  is  deficient  in  that  it  ignores  the  PERSON  feature.  The  following 
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constraint  which  makes  use  of  the  AGREEMENT  feature  takes  care  of  this 
problem. 

S -A  NP  VP 

(NP  agreement)  = (VP  agreement) 

Examples  11.10  and  11.11  illustrate  a minor  variation  on  SUBJECT- 
VERB  agreement.  In  these  Yes-No  questions,  the  subject  NP  must  agree 
with  the  auxiliary  verb,  rather  than  the  main  verb  of  the  sentence,  which 
appeal's  in  a non-finite  form.  This  agreement  constraint  can  be  handled  by 
the  following  rule. 

5 -A  AivcNPVP 

( Aux  agreement)  = (NP  agreement) 

Agreement  between  determiners  and  nominals  in  noun  phrases  is  han- 
dled in  a similar  fashion.  The  basic  task  is  to  allow  the  forms  given  above, 
but  block  the  unwanted  *this  flights  and  *those  flight  forms  where  the  de- 
terminers and  nominals  clash  in  their  NUMBER  feature.  Again,  the  logical 
place  to  enforce  this  constraint  is  in  the  grammar  rule  that  brings  the  parts 
together. 

NP  -A  Det  Nominal 

(Det  AGREEMENT)  = (Nominal  AGREEMENT) 

(NP  AGREEMENT)  = (Nominal  AGREEMENT) 

This  rule  states  that  the  AGREEMENT  feature  of  the  Det  must  unify  with 
the  AGREEMENT  feature  of  the  Nominal , and  moreover,  that  the  AGREE- 
MENT feature  of  the  NP  is  constrained  to  be  the  same  as  that  of  the  Nominal. 

Having  expressed  the  constraints  needed  to  enforce  subject-verb  and 
determiner-nominal  agreement,  we  must  now  fill  in  the  rest  of  the  machinery 
needed  to  make  these  constraints  work.  Specifically,  we  must  consider  how 
the  various  constituents  that  take  part  in  these  constraints  (the  Awe,  VP,  NP, 
Det,  and  Nominal ) acquire  values  for  their  various  agreement  features. 

We  can  begin  by  noting  that  our  constraints  involve  both  lexical  and 
non-lexical  constituents.  The  simpler  lexical  constituents,  Aux  and  Det,  re- 
ceive values  for  their  respective  agreement  features  directly  from  the  lexicon 
as  in  the  following  rules. 

Aux  -A  do 

(Aux  AGREEMENT  NUMBER)  = PL 

(Aux  AGREEMENT  PERSON)  = 3 
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Aux  — > does 

(S  AGREEMENT  NUMBER)  = SG 
(S  AGREEMENT  PERSON)  = 3 

Determiner  — > this 

{Determiner  AGREEMENT  NUMBER)  = SG 

Determiner  — > these 

{Determiner  AGREEMENT  NUMBER)  = PL 

Returning  to  our  first  S rule,  let  us  first  consider  the  AGREEMENT  fea- 
ture for  the  VP  constituent.  The  constituent  structure  for  this  VP  is  specified 
by  the  following  rule. 

VP  -A  VerhNP 

It  seems  clear  that  the  agreement  constraint  for  this  constituent  must 
be  based  on  its  constituent  verb.  This  verb,  as  with  the  previous  lexical 
entries,  can  acquire  its  agreement  feature  values  directly  from  lexicon  as  in 
the  following  rules. 

Verb  — > serve 

( Verb  AGREEMENT  NUMBER)  = PL 
Verb  -A  serves 

{Verb  AGREEMENT  NUMBER)  = SG 
{Verb  AGREEMENT  PERSON)  = 3 

All  that  remains  is  to  stipulate  that  the  agreement  feature  of  the  parent  V7Jis 
constrained  to  be  the  same  as  its  verb  constituent. 

VP  — > Verb  NP 

{VP  agreement)  = {Verb  agreement) 

In  other  words,  non-lexical  grammatical  constituents  can  acquire  values  for 
at  least  some  of  their  features  from  their  component  constituents. 

The  same  technique  works  for  the  remaining  NP  and  Nominal  cate- 
gories. The  values  for  the  agreement  features  for  these  categories  are  derived 
from  the  nouns  flight  and  flights. 

Noun  — > flight 

{ Noun  AGREEMENT  NUMBER)  = SG 
Noun  — » flights 

{ Noun  AGREEMENT  NUMBER)  = PL 
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PHRASE 
HEAD 
FEATURES 


Similarly,  the  Nominal  features  arc  constrained  to  have  the  same  values 
as  its  constituent  noun,  as  follows. 

Nominal  — > Noun 

(, Nominal  AGREEMENT)  = (Noun  AGREEMENT) 

Note  that  this  section  has  only  scratched  the  surface  of  the  English 
agreement  system,  and  that  the  agreement  system  of  other  languages  can  be 
considerably  more  complex  than  English. 

Head  Features 

To  account  for  the  way  compositional  grammatical  constituents  such  as  noun 
phrases,  nominals,  and  verb  phrases  come  to  have  agreement  features,  the 
preceding  section  introduced  the  notion  of  copying  needed  feature  structures 
from  children  to  their  parents.  This  use  turns  out  to  be  a specific  instance 
of  a much  more  general  phenomenon  in  constraint-based  grammars.  Specif- 
ically, the  features  for  most  grammatical  categories  arc  copied  from  one  of 
the  children  to  the  parent.  The  child  that  provides  the  features  is  called  the 
head  of  the  phrase,  and  the  features  copied  are  referred  to  as  head  features. 

To  make  this  clear,  consider  the  following  three  rules  from  the  last 
section. 

VP  — > Verb  NP 

(VP  agreement)  = (Verb  agreement) 

NP  — > Det  Nominal 

(Det  AGREEMENT)  = (Nominal  AGREEMENT) 

(NP  AGREEMENT)  = (Nominal  AGREEMENT) 

Nominal  — > Noun 

(Nominal  AGREEMENT)  = (Noun  AGREEMENT) 

In  each  of  these  rules,  the  constituent  providing  the  agreement  feature 
structure  up  to  the  parent  is  the  head  of  the  phrase.  More  specifically,  the 
verb  is  the  head  of  the  verb  phrase,  the  nominal  is  the  head  of  the  noun 
phrase,  and  the  noun  is  the  head  of  the  nominal.  In  addition,  we  can  say  that 
the  agreement  feature  structure  is  a head  feature.  We  can  rewrite  our  rules  to 
reflect  these  generalizations  by  placing  the  agreement  feature  structure  under 
a HEAD  feature  and  then  copying  that  feature  upward  as  in  the  following 
constraints. 
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VP  -f  Verb  NP  (11.12) 

(VP  HEAD)  = { Verb  HEAD) 

NP  — > Det  Nominal  (11.13) 

(NP  HEAD)  = (Nominal  HEAD) 

(Det  HEAD  AGREEMENT)  = (Nominal  HEAD  AGREEMENT) 

Nominal  — > Noun  (11.14) 

(Nominal  HEAD)  = (Noun  HEAD) 

Similarly,  the  lexical  rules  that  introduce  these  features  must  now  re- 
flect this  HEAD  notion,  as  in  the  following. 

Noun  — > flights 

(Noun  HEAD  AGREEMENT  NUMBER)  = PL 
Verb  — > serves 

(Verb  HEAD  AGREEMENT  NUMBER)  = SG 
(Verb  HEAD  AGREEMENT  PERSON)  = 3 

The  notion  of  a head  is  an  extremely  significant  one  in  grammar,  be- 
cause it  provides  a way  for  a syntactic  rule  to  be  linked  to  a particular-  word. 
In  this  way  heads  will  play  an  important  role  in  the  dependency  grammars 
and  lexicalized  grammars  of  Chapter  12,  and  the  head  transducers  men- 
tioned in  Chapter  21. 

Subcategorization 

Recall  that  subcategorization  is  the  notion  that  verbs  can  be  picky  about  the 
patterns  of  arguments  they  will  allow  themselves  to  appeal-  with.  In  Chap- 
ter 9,  to  prevent  the  generation  of  ungrammatical  sentences  with  verbs  and 
verb  phrases  that  do  not  match,  we  were  forced  to  split  the  category  of  verb 
into  multiple  sub-categories.  These  more  specific  verbs  were  then  used  in 
the  definition  of  the  specific  verb  phrases  that  they  were  allowed  to  occur 
with,  as  in  the  following  rule. 

Verb-with-S-comp  — > think 
VP  -A  Verb-with-S-comp  S 

Clearly,  this  approach  introduces  exactly  the  same  undesirable  prolif- 
eration of  categories  that  we  saw  with  the  similar  approach  to  solving  the 
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number  problem.  The  proper  way  to  avoid  this  proliferation  is  to  introduce 
feature  structures  to  distinguish  among  the  various  members  of  the  verb  cate- 
gory. This  goal  can  be  accomplished  by  associating  an  atomic  feature  called 
SUBCAT,  with  an  appropriate  value,  with  each  of  the  verbs  in  the  lexicon. 
For  example,  the  transitive  version  of  serves  could  be  assigned  the  following 
feature  structure  in  the  lexicon. 

Verb  -a  serves 

{Verb  HEAD  AGREEMENT  NUMBER)  = SG 
{Verb  HEAD  SUBCAT)  = TRANS 

The  SUBCAT  feature  is  a signal  to  the  rest  of  the  grammar  that  this  verb 
should  only  appeal-  in  verb  phrases  with  a single  noun  phrase  argument.  This 
constraint  is  enforced  by  adding  corresponding  constraints  to  all  the  verb 
phrase  rules  in  the  grammar,  as  in  the  following. 

VP  — > Verb 

{VP  head)  = {Verb  head) 

{VP  HEAD  SUBCAT)  = INTRANS 

VP  — > Verb  NP 

{VP  head)  = {Verb  head) 

{VP  HEAD  SUBCAT)  = TRANS 

VP  -A  Verb  NP  NP 

{VP  head)  = {Verb  head) 

{VP  HEAD  SUBCAT)  = DITRANS 

The  first  unification  constraint  in  these  rules  states  that  the  verb  phrase 
receives  its  HEAD  features  from  its  verb  constituent,  while  the  second  con- 
straint specifies  what  the  value  of  that  SUBCAT  feature  must  be.  Any  attempt 
to  use  a verb  with  an  inappropriate  verb  phrase  will  fail  since  the  value  of  the 
SUBCAT  feature  of  the  VP  will  fail  to  unify  with  the  atomic  symbol  given  in 
second  constraint.  Note  this  approach  requires  unique  symbols  for  each  of 
the  50  to  100  verb  phrase  frames  in  English. 

This  is  a somewhat  opaque  approach  since  these  unanalyzable  SUBCAT 
symbols  do  not  directly  encode  either  the  number  or  type  of  the  arguments 
that  the  verb  expects  to  take.  To  see  this,  note  that  one  can  not  simply  exam- 
ine a verb’s  entry  in  the  lexicon  and  know  what  its  subcategorization  frame 
is.  Rather,  you  must  use  the  value  of  the  SUBCAT  feature  indirectly  as  a 
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pointer  to  those  verb  phrase  rules  in  the  grammar  that  can  accept  the  verb  in 
question. 

A somewhat  more  elegant  solution,  which  makes  better  use  of  the  ex- 
pressive power  of  feature  structures,  allows  the  verb  entries  to  directly  spec- 
ify the  order  and  category  type  of  the  arguments  they  require.  The  following 
entry  for  serves  is  an  example  of  one  such  approach,  in  which  the  verb’s 
subcategory  feature  expresses  a list  of  its  objects  and  complements. 

Verb  — >•  serves 

(Verb  HEAD  AGREEMENT  NUMBER)  = SG 
(Verb  HEAD  SUBCAT  FIRST  CAT)  = NP 
(Verb  HEAD  SUBCAT  SECOND)  = END 

This  entry  uses  the  FIRST  feature  to  state  that  the  first  post-verbal  ar- 
gument must  be  an  NP:  the  value  of  the  SECOND  feature  indicates  that  this 
verb  expects  only  one  argument.  A verb  like  leave  Boston  in  the  morning, 
with  two  arguments,  would  have  the  following  kind  of  entry. 

Verb  — > leaves 

(Verb  HEAD  AGREEMENT  NUMBER)  = SG 
(Verb  HEAD  SUBCAT  FIRST  CAT)  = NP 
(Verb  HEAD  SUBCAT  SECOND  CAT)  = PP 
(Verb  HEAD  SUBCAT  THIRD)  = END 

This  scheme  is,  of  course,  a rather  baroque  way  of  encoding  a list;  it  is 
also  possible  to  use  the  idea  of  types  defined  in  Section  11.6  to  define  a list 
type  more  cleanly. 

The  individual  verb  phrase  rules  must  now  check  for  the  presence  of 
exactly  the  elements  specified  by  their  verb,  as  in  the  following  transitive 
rule. 


VP  -A  VerbNP  (11.15) 

(VP  HEAD)  = (Verb  HEAD) 

(VP  HEAD  SUBCAT  FIRST  CAT)  = (NP  CAT  ) 

(VP  HEAD  SUBCAT  SECOND)  = END 

The  second  constraint  in  this  rule’s  constraints  states  that  the  category 
of  the  first  element  of  the  verb’s  SUBCAT  list  must  match  the  category  of  the 
constituent  immediately  following  the  verb.  The  third  constraint  goes  on  to 
state  that  this  verb  phrase  rule  expects  only  a single  argument. 
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Our  previous  examples  have  shown  rather  simple  subcategorization 
structures  for  verbs.  In  fact,  verbs  can  subcategorize  for  quite  complex  sub- 

SUBCATEGO" 

RiaroN  categorization  frames,  (e.g.  NP  PP,  NP  NP , or  NP  S)  and  these  frames  can 
be  composed  of  many  different  phrasal  types.  In  order  to  come  up  with  a list 
of  possible  subcategorization  frames  for  English  verbs,  we  first  need  to  have 
a list  of  possible  phrase  types  that  can  make  up  these  frames.  Figure  11.3 
shows  one  short  list  of  possible  phrase  types  for  making  up  subcategorization 
frames  for  verbs;  this  list  is  modified  from  one  used  to  create  verb  subcate- 
gorization frames  in  the  FrameNet  project  (Baker  et  al,  1998),  and  includes 
phrase  types  for  the  subjects  of  verbs  there,  it,  as  well  as  objects  and  com- 
plements. 

To  use  the  phrase  types  in  Figure  11.3  in  a unification  grammar,  each 
phrase  type  would  have  to  be  described  using  features.  For  example  the  form 
VPto  which  is  subcategorized  for  by  want  might  be  expressed  as: 

Verb  — > want 

( Verb  HEAD  SUBCAT  FIRST  CAT)  = VP 

(Verb  HEAD  SUBCAT  FIRST  FORM)  = INFINITIVE 

Each  of  the  50  to  100  possible  verb  subcategorization  frames  in  English 
would  be  described  as  a set  drawn  from  these  phrase  types.  For  example, 
here’s  an  example  of  the  two-complement  want.  We’ve  used  this  following 
example  to  demonstrate  two  different  notational  possibilities.  First,  lists  can 
be  represented  via  an  angle  brackets  notation  { and  ).  Second,  instead  of  us- 
ing a rewrite -rule  annotated  with  path  equations,  we  can  represent  the  lexical 
entry  as  a single  feature  structure: 


ORTH  WANT 
CAT  VERB 


HEAD 


SUBCAT  { 


CAT  NP 


CAT  VP 

HEAD  [vFORM  INFINITIVE 


Combining  even  a limited  set  of  phrase  types  results  in  a very  large  set 
of  possible  subcategorization  frames.  Furthermore,  each  verb  allows  many 
different  subcategorization  frames.  For  example,  here  are  just  some  of  the 
subcategorization  patterns  for  the  verb  ask,  with  examples  from  the  BNC: 
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Noun  Phrase  Types 

There 

nonreferential  there 

There  is  still  much  to  learn. . . 

It 

nonreferential  it 

It  was  evident  that  my  ideas. . . 

NP 

noun  phrase 

,4,v  he  was  relating  his  story. . . 

Preposition  Phrase  Types 

PP 

preposition  phrase 

couch  their  message  in  terms. . . 

PPing 

gerundive  PP 

censured  him  for  not  having  intervened. . . 

PPpart 

Particle 

turn  it  off 

Verb  Phrase  Types 

VPbrst 

bare  stem  VP 

she  could  discuss  it 

VPto 

to-marked  infin.  VP 

Why  do  you  want  to  know? 

VPwh 

Wh-  VP 

it  is  worth  considering  how  to  write 

VPing 

gerundive  VP 

I would  consider  using  it 

Complement  Clause  types 

Finite  Clause 

finite  clause 

maintain  that  the  situation  was  unsatisfactory 

Wh-  clause 

...it  tells  us  where  we  are. . . 

Swheth 

Whether/if  clause 

ask  whether  Aristophanes  is  depicting  a. . . 

Nonfinite  Clause 

Sing 

gerundive  clause 

...see  some  attention  being  given. . . 

Sto 

to-marked  clause 

know  themselves  to  be  relatively  unhealthy 

Sforto 

for-to  clause 

She  was  waiting  for  him  to  make  some  reply. . . 

Sbrst 

bare  stem  clause 

commanded  that  his  sermons  be  published 

Other  Types 

AjP 

adjective  phrase 

thought  it  possible 

Quo 

quotes 

asked  “What  was  it  like?” 

Figure  11.3  A small  set  of  potential  phrase  types  which  can  be  combined 
to  create  a set  of  potential  subcategorization  frames  for  verbs.  Modified  from 
the  FrameNet  tagset  (Baker  et  al.,  1998).  The  sample  sentence  fragments  are 
from  the  British  National  Corpus. 


Subcat 

Example 

Quo 

asked  [quo  “What  was  it  like?”] 

NP 

asking  [ ,\i/>  a question] 

Swh 

asked  [swh  what  trades  you’re  interested  in] 

Sto 

ask  [ gl0  him  to  tell  you] 

PP 

that  means  asking  \pp  at  home] 

Vto 

asked  [ yto  to  see  a girl  called  Evelyn] 

NP  Swheth 

asked  | y /J  him]  \ Swheth  whether  he  could  make] 

NP  NP 

asked  [jyp  myself]  [ ,y/j  a question] 

NP  Swh 

asked  f.y/j  him]  \swh  why  he  took  time  off] 
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A number  of  comprehensive  subcategorization-frame  tagsets  exist,  such 
as  the  COMLEX  set  (Macleod  et  ah,  1998),  which  includes  subcategoriza- 
tion frames  for  verbs,  adjectives,  and  nouns,  and  the  ACQUILEX  tagset  of 
verb  subcategorization  frames  (Sanfilippo,  1993).  Many  subcategorization- 
frame  tagsets  add  other  information  about  the  complements,  such  as  spec- 
ifying the  identity  of  the  subject  in  a lower  verb  phrase  that  has  no  overt 
subject;  this  is  called  control  information.  For  example  Temmy  promised 
Ruth  to  go  (at  least  in  some  dialects)  implies  that  Temmy  will  do  the  go- 
ing, while  Temmy  persuaded  Ruth  to  go  implies  that  Ruth  will  do  the  going. 
Some  of  the  multiple  possible  subcategorization  frames  for  a verb  can  be 
partially  predicted  by  the  semantics  of  the  verb;  for  example  many  verbs  of 
transfer  (like  give,  send,  carry)  predictably  take  the  two  subcategorization 
frames  NP  NP  and  NP  PP: 

NP  NP  sent  FAA  Administrator  James  Busey  a letter 

NP  PP  sent  a letter  to  the  chairman  of  the  Armed  Services  Committee 

These  relationships  between  subcategorization  frames  across  classes 
of  verbs  arc  called  argument-structure  alternations,  and  will  be  discussed 
in  Chapter  16  when  we  discuss  the  semantics  of  verbal  argument  structure. 
Chapter  12  will  introduce  probabilities  for  modeling  the  fact  that  verbs  gen- 
erally have  a bias  toward  which  of  their  possible  they  prefer. 

Subcategorization  in  Other  Parts  of  Speech 

Although  the  notion  of  subcategorization,  or  valence  as  it  is  often  called,  was 
originally  designed  for  verbs,  more  recent  work  has  focused  on  the  fact  that 
many  other  kinds  of  words  exhibit  forms  of  valence-like  behavior.  Consider 
the  following  contrasting  uses  of  the  prepositions  while  and  during. 

(11.16)  Keep  your  seatbelt  fastened  while  we  are  taking  off. 

(11.17)  *Keep  your  seatbelt  fastened  *while  takeoff. 

(11.18)  Keep  your  seatbelt  fastened  during  takeoff. 

(11.19)  *Keep  your  seatbelt  fastened  during  we  are  taking  off. 

Despite  the  apparent  similarities  between  these  words,  they  make  quite  dif- 
ferent demands  on  their  arguments.  Representing  these  differences  is  left  as 
Exercise  11.5  for  the  reader. 

Many  adjectives  and  nouns  also  have  subcategorization  frames.  Here 
arc  some  examples  using  the  adjectives  apparent,  aware,  and  unimportant 
and  the  nouns  assumption  and  question : 


Section  1 1.3. 


Features  Structures  in  the  Grammar 


413 


It  was  apparent  [ that  the  kitchen  was  the  only  room. . . ] 

It  was  apparent  [ppfrom  the  way  she  rested  her  hand  over  his] 
aware  [ he  may  have  caused  offense] 
it  is  unimportant  iswheth  whether  only  a little  bit  is  accepted] 
the  assumption  [ $jUI  that  wasteful  methods  have  been  employed] 
the  question  [ Swheth  whether  the  authorities  might  have  decided] 

See  Macleod  et  al.  (1998)  for  a description  of  subcategorization  frames 
for  nouns  and  adjectives. 

Verbs  express  subcategorization  constraints  on  their  subjects  as  well  as 
their  complements.  For  example,  we  need  to  represent  the  lexical  fact  that 
the  verb  seem  can  take  a Sfin  as  its  subject  ( That  she  was  affected  seems 
obvious ),  while  the  verb  paint  cannot.  The  SUBJECT  feature  can  be  used  to 
express  these  constraints. 


Long  Distance  Dependencies 


The  model  of  subcategorization  we  have  developed  so  far  has  two  compo- 
nents. Each  head  word  has  a SUBCAT  feature  which  contains  a list  of  the 
complements  it  expects.  Then  phrasal  rules  like  the  VP  rule  in  (11.16)  match 
up  each  expected  complement  in  the  SUBCAT  list  with  an  actual  constituent. 
This  mechanism  works  fine  when  the  complements  of  a verb  arc  in  fact  to 
be  found  in  the  verb  phrase. 

Sometimes,  however,  a constituent  subcategorized  for  by  the  verb  is 
not  locally  instantiated,  but  is  in  a long-distance  relationship  with  the  pred- 
icate. Here  are  some  examples  of  such  long-distance  dependencies: 

What  cities  does  Continental  service? 

What  flights  do  you  have  from  Boston  to  Baltimore? 

What  time  does  that  flight  leave  Atlanta? 


LONG- 

DISTANCE 

DEPENDEN- 

CIES 


In  the  first  example,  the  constituent  what  cities  is  subcategorized  for 
by  the  verb  service , but  because  the  sentence  is  an  example  of  a wh-non- 
subject-question,  the  object  is  located  at  the  front  of  the  sentence.  Recall 
from  Chapter  9 that  a (simple)  phrase-structure  rule  for  a wh-non-subject- 
question  is  something  like  the  following: 


S -A  Wh-NPAuxNPVP 


Now  that  we  have  features,  we’ll  be  able  to  augment  this  phrase-structure 
rule  to  require  the  Aux  and  the  NP  to  agree  (since  the  NP  is  the  subject). 
But  we  also  need  some  way  to  augment  the  rule  to  tell  it  that  the  Wh-NP 
should  fill  some  subcategorization  slot  in  the  VP.  The  representation  of  such 
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GAP  LIST 
FILLER 

11.4 


long-distance  dependencies  is  a quite  difficult  problem,  because  the  verb 
whose  subcategorization  requirement  is  being  filled  can  be  quite  distant  from 
the  tiller.  In  the  following  (made-up)  sentence,  for  example,  the  vv/;- phrase 
which  flight  must  till  the  subcategorization  requirements  of  the  verb  book, 
despite  the  fact  that  there  arc  two  other  verbs  {want  and  have ) in  between: 

Which  flight  do  you  want  me  to  have  the  travel  agent  book? 

Many  solutions  to  representing  long-distance  dependencies  in  unifica- 
tion grammars  involve  keeping  a list,  often  called  a gap  list,  implemented 
as  a feature  GAP,  which  is  passed  up  from  phrase  to  phrase  in  the  parse 
tree.  The  filler  (for  example  which  flight  above)  is  put  on  the  gap  list,  and 
must  eventually  be  unified  with  the  subcategorization  frame  of  some  verb. 
See  Sag  and  Wasow  (1999)  for  an  explanation  of  such  a strategy,  together 
with  a discussion  of  the  many  other  complications  that  must  be  modeled  in 
long-distance  dependencies. 


Implementing  Unification 

As  discussed,  the  unification  operator  takes  two  feature  structures  as  input 
and  returns  a single  merged  feature  structure  if  successful,  or  a failure  sig- 
nal if  the  two  inputs  are  not  compatible.  The  input  feature  structures  arc 
represented  as  directed  acyclic  graphs  (DAGs),  where  features  arc  depicted 
as  labels  on  directed  edges,  and  feature  values  arc  either  atomic  symbols  or 
DAGs.  As  we  will  see,  the  implementation  of  the  operator  is  a relatively 
straightforward  recursive  graph  matching  algorithm,  suitably  tailored  to  ac- 
commodate the  various  requirements  of  unification.  Roughly  speaking,  the 
algorithm  loops  through  the  features  in  one  input  and  attempts  to  find  a corre- 
sponding feature  in  the  other.  If  all  of  the  features  match,  then  the  unification 
is  successful.  If  any  single  feature  causes  a mismatch  then  the  unification 
fails.  Not  surprisingly,  the  recursion  is  motivated  by  the  need  to  correctly 
match  those  features  that  have  feature  structures  as  their  values. 

One  somewhat  unusual  aspect  of  the  algorithm  is  that  rather  than  con- 
struct a new  output  feature  structure  with  the  unified  information  from  all  the 
information  from  the  two  arguments,  it  destructively  alters  the  arguments  so 
that  in  the  end  they  point  to  exactly  the  same  information.  Thus  the  result 
of  a successful  call  to  the  unification  operator  consists  of  suitably  altered 
versions  of  the  arguments  (failed  unifications  also  result  in  alterations  to  the 
arguments,  but  more  on  that  later  in  Section  11.5.)  As  is  discussed  in  the 
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next  section,  the  destructive  nature  of  this  algorithm  necessitates  certain  mi- 
nor extensions  to  the  simple  graph  version  of  feature  structures  as  DAGs  we 
have  been  assuming. 


Unification  Data  Structures 


To  facilitate  the  destructive  merger  aspect  of  the  algorithm,  we  add  a small 
complication  to  the  DAGs  used  to  represent  the  input  feature  structures;  fea- 
ture structures  arc  represented  using  DAGs  with  additional  edges,  or  fields. 
Specifically,  each  feature  structure  consists  of  two  fields:  a content  field  and 
a pointer  field.  The  content  field  may  be  null  or  contain  an  ordinary  fea- 
ture structure.  Similarly,  the  pointer  field  may  be  null  or  contain  a pointer  to 
another  feature  structure.  If  the  pointer  field  of  the  DAG  is  null,  then  the  con- 
tent field  of  the  DAG  contains  the  actual  feature  structure  to  be  processed. 
If,  on  the  other  hand,  the  pointer  field  is  non-null,  then  the  destination  of  the 
pointer  represents  the  actual  feature  structure  to  be  processed.  Not  surpris- 
ingly, the  merger  aspects  of  unification  arc  achieved  by  altering  the  pointer 
field  of  DAGs  during  processing. 

To  make  this  scheme  somewhat  more  concrete,  consider  the  extended 
DAG  representation  for  the  following  familial-  feature  structure. 


(11.20) 


NUMBER 

PERSON 


SG 

3 


The  extended  DAG  representation  is  illustrated  with  our  textual  matrix  dia- 
grams by  treating  the  CONTENT  and  POINTER  fields  as  ordinary  features,  as 
in  the  following  matrix. 


(11.21) 


CONTENT 

NUMBER 

CONTENTS 

POINTER 

SG 

NULL 

PERSON 

CONTENTS 

POINTER 

3 

NULL 

POINTER  NULL 


Figure  11.4  shows  this  extended  representation  in  its  graphical  form. 
Note  that  the  extended  representation  contains  content  and  pointer  links  both 
for  the  top-level  layer  of  features,  as  well  as  for  each  of  the  embedded  feature 
structures  all  the  way  down  to  the  atomic  values. 

Before  going  on  to  the  details  of  the  unification  algorithm,  we  will 
illustrate  the  use  of  this  extended  DAG  representation  with  the  following 
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Figure  11.4  An  extended  DAG  notation  for  Examples  1 1.20  and  1 1.21. 


simple  example.  The  original  extended  representation  of  the  arguments  to 
this  unification  are  shown  in  Figure  1 1.5. 


(11.22) 

NUMBER  SG 

U 

PERSON  3 

= 

NUMBER 

SG 

PERSON 

3 

Figure  11.5  The  original  arguments  to  Example  1 1.22. 
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At  a high  level,  we  would  simply  say  that  the  unification  results  in  the 
creation  of  a new  structure  containing  the  union  of  the  information  from  the 
two  original  arguments.  With  the  extended  notation,  we  can  see  how  the  uni- 
fication is  accomplished  by  making  some  additions  to  the  original  arguments 
and  changing  some  of  the  pointers  from  one  structure  to  the  other  so  that  in 
the  end  they  contain  the  same  content.  In  this  example,  this  is  accomplished 
by  first  adding  a PERSON  feature  to  the  first  argument,  and  assigning  it  a 
value  by  filling  its  POINTER  field  with  a pointer  to  the  appropriate  location 
in  the  second  argument,  as  shown  in  Figure  11.6. 


Figure  11.6  The  arguments  after  assigning  the  first  argument’s  new  PER- 
SON feature  to  the  appropriate  value  in  the  second  argument. 
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The  process  is,  however,  not  yet  complete.  While  it  is  clear  from  Figure 
11.6  that  the  first  argument  now  contains  all  the  correct  information,  the 
second  one  does  not;  it  lacks  a NUMBER  feature.  We  could,  of  course,  add 
a NUMBER  feature  to  this  argument  with  a pointer  to  the  appropriate  place 
in  the  first  one.  This  change  would  result  in  the  two  arguments  having  all 
the  correct  information  from  this  unification.  Unfortunately,  this  solution  is 
inadequate  since  it  does  not  meet  our  requirement  that  the  two  arguments  be 
truly  unified.  Since  the  two  arguments  arc  not  completely  unified  at  the  top 
level,  future  unifications  involving  one  of  the  arguments  would  not  show  up 
in  the  other.  The  solution  to  this  problem  is  to  simply  set  the  POINTER  field 
of  the  second  argument  to  point  at  the  first  one.  When  this  is  done  any  future 
change  to  either  argument  will  be  immediately  reflected  in  both.  The  result 
of  this  final  change  is  shown  in  Figure  11.7. 


Figure  11.7  The  final  result  of  unifying  FI  and  F2. 
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function  Unify(/7,/2)  returns  /structure  or  failure 

fl-real<r-  Real  contents  of/7 
f2-real<r- Real  contents  of/2 

if/7  -real  is  null  then 
f 1. pointer  <^f2 

return /2 

else  if  f2-real  is  null  then 
f2. pointer  i—fl 

return  fl 

else  if  fl -real  and  f2-real  are  identical  then 
fl. pointer  4—f2 

return  f2 

else  if  both  fl -real  and  f2-real  are  complex  feature  structures  then 
f2.pointert—fl 
for  each  feature  \n  f2-real  do 
other-feature  ■<— Find  or  create 

a feature  corresponding  to  feature  inf  1-real 
if  Unify (feature.value,  other-feature.value)  returns  failure  then 
return  failure 
return  fl 

else  return  failure 


Figure  11.8  The  Unification  Algorithm. 


The  Unification  Algorithm 

The  unification  algorithm  that  we  have  been  leading  up  to  is  shown  in  Figure 
11.8.  To  review,  this  algorithm  accepts  two  feature  structures  represented 
using  the  extended  DAG  representation.  As  can  be  seen  from  the  code,  it 
may  return  as  its  return  either  one  of  these  arguments.  This  is,  however, 
somewhat  deceptive  since  the  true  effect  of  this  algorithm  is  the  destructive 
unification  of  the  two  inputs. 

The  first  step  in  this  algorithm  is  to  acquire  the  true  contents  of  both  of 
the  arguments.  Recall  that  if  the  pointer  field  of  an  extended  feature  structure 
is  non-null,  then  the  real  content  of  that  structure  is  found  by  following  the 
pointer  found  in  pointer  field.  The  variables  fl-real  and  f2-real  arc  the  result 
of  this  pointer  following  process,  which  is  often  referred  to  as  dereferenc- 
ing. 
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As  with  all  recursive  algorithms,  the  next  step  is  to  test  for  the  various 
base  cases  of  the  recursion  before  proceeding  on  to  a recursive  call  involving 
some  part  of  the  original  arguments.  In  this  case,  there  arc  three  possible  base 
cases: 

• One  or  both  of  the  arguments  has  a null  value. 

• The  arguments  arc  identical. 

• The  arguments  arc  non-complex  and  non-identical. 

In  the  case  where  either  of  the  arguments  is  null,  the  pointer  field  for 
the  null  argument  is  changed  to  point  to  the  other  argument,  which  is  then 
returned.  The  result  is  that  both  structures  now  point  at  the  same  value. 

If  the  structures  arc  identical,  then  the  pointer  of  the  first  is  set  to  the 
second  and  the  second  is  returned.  It  is  important  to  understand  why  this 
pointer  change  is  done  in  this  case.  After  all,  since  the  arguments  arc  iden- 
tical, returning  either  one  would  appeal-  to  suffice.  This  might  be  true  for  a 
single  unification  but  recall  that  we  want  the  two  arguments  to  the  unification 
operator  to  be  truly  unified.  The  pointer  change  is  necessary  since  we  want 
the  arguments  to  be  truly  identical,  so  that  any  subsequent  unification  that 
adds  information  to  one  will  add  it  to  both. 

If  neither  of  the  preceding  tests  is  true  then  there  are  two  possibili- 
ties: they  are  non-identical  atomic  values,  or  they  are  non-identical  complex 
structures.  The  former  case  signals  an  incompatibility  in  the  arguments  that 
leads  the  algorithm  to  return  a failure  signal.  In  the  latter  case,  a recursive 
call  is  needed  to  ensure  that  the  component  parts  of  these  complex  structures 
are  compatible.  In  this  implementation,  the  key  to  the  recursion  is  a loop 
over  all  the  features  of  the  second  argument,  f2.  This  loop  attempts  to  unify 
the  value  of  each  feature  in/2  with  the  corresponding  feature  in  fl.  In  this 
loop,  if  a feature  is  encountered  in  f2  that  is  missing  from  fl , a feature  is 
added  to  fl  and  given  the  value  NULL.  Processing  then  continues  as  if  the 
feature  had  been  there  to  begin  with.  If  every  one  of  these  unifications  suc- 
ceeds, then  the  pointer  field  of/2  is  set  to  fl  completing  the  unification  of  the 
structures  and  fl  is  returned  as  the  value  of  the  unification. 

We  should  note  that  an  unfortunate  aspect  of  this  algorithm  is  that  it 
is  capable  of  producing  feature  structures  containing  cycles.  This  situation 
can  arise  when  the  algorithm  is  asked  to  unify  a structure  with  a second 
structure  that  contains  the  first  as  a subpart.  The  way  to  avoid  this  situation 
is  to  employ  what  is  called  an  occur  check  (Robinson,  1965).  This  check 
analyzes  the  input  DAGs  and  returns  failure  when  one  of  the  arguments  is 
contained  as  a subpart  of  the  other.  In  practice,  this  check  is  omitted  from 
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most  implementations  due  to  its  computational  cost. 


An  Example 

To  illustrate  this  algorithm,  let  us  walk  through  the  following  example. 


(11.23) 


U 


AGREEMENT  CD 
SUBJECT 


NUMBER  SG 


SUBJECT 


AGREEMENT  Q] 

PERSON  3 


AGREEMENT 


Figure  11.9  shows  the  extended  representations  for  the  arguments  to 
this  unification.  Note  how  the  reentrant  structure  in  the  first  argument  is 
captured  through  the  use  of  the  PTR  field. 

These  original  arguments  are  neither  identical,  nor  null,  nor  atomic,  so 
the  main  loop  is  entered.  Looping  over  the  features  of/2,  the  algorithm  is 
led  to  a recursive  attempt  to  unify  the  values  of  the  corresponding  SUBJECT 
features  of fl  and/2. 
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AGREEMENT 


m 


U 


AGREEMENT 


PERSON  3 


These  arguments  arc  also  non-identical,  non-null,  and  non-atomic  so 
the  loop  is  entered  again  leading  to  a recursive  check  of  the  values  of  the 
AGREEMENT  features. 


NUMBER 


SG 


U 


PERSON 


3 


In  looping  over  the  features  of  the  second  argument,  the  fact  that  the 
first  argument  lacks  a PERSON  feature  is  discovered.  A PERSON  feature 
initialized  with  a NULL  value  is,  therefore,  added  to  the  first  argument.  This, 
in  effect,  changes  the  previous  unification  to  the  following. 


NUMBER 

SG 

U 

PERSON  3 

PERSON 

NULL 

After  creating  this  new  PERSON  feature,  the  next  recursive  call  leads 
to  the  unification  of  the  NULL  value  of  the  new  feature  in  the  first  argument 
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with  the  3 value  of  the  second  argument.  This  recursive  call  results  in  the 
assignment  of  the  pointer  field  of  the  first  argument  to  the  3 value  in/2,  as 
shown  in  11.10. 

Since  there  are  no  further  features  to  check  in  the  f2  argument  at  any 
level  of  recursion,  each  in  turn  sets  the  pointer  for  its/2  argument  to  point  at 
its  fl  argument  and  returns  it.  The  result  of  all  these  assignments  is  shown 
in  Figure  11.11. 


11.5  Parsing  with  Unification  Constraints 

We  now  have  all  the  pieces  necessary  to  the  integrate  feature  structures  and 
unification  into  a parser.  Fortunately,  the  order-independent  nature  of  unifi- 
cation allows  us  to  largely  ignore  the  actual  search  strategy  used  in  the  parser. 
Once  we  have  unification  constraints  associated  with  the  context-free  rules 


424 


Chapter  1 1 . Features  and  Unification 


of  the  grammar,  and  feature  structures  with  the  states  of  the  search,  any  of 
the  standard  search  algorithms  described  in  Chapter  10  can  be  used. 

Of  course,  this  leaves  a fairly  large  range  of  possible  implementation 
strategies.  We  could,  for  example,  simply  parse  as  we  did  before  using  the 
context-free  components  of  the  rules,  and  then  build  the  feature  structures 
for  the  resulting  trees  after  the  fact,  filtering  out  those  parses  that  contain 
unification  failures.  Although  such  an  approach  would  result  in  only  well- 
formed  structures  in  the  end,  it  fails  to  use  the  power  of  unification  to  reduce 
the  size  of  the  parser’s  search  space  during  parsing. 

The  next  section  describes  an  approach  that  makes  better  use  of  the 
power  of  unification  by  integrating  unification  constraints  directly  into  the 
Earley  parsing  process,  allowing  ill-formed  structures  to  be  eliminated  as 
soon  as  they  arc  proposed.  As  we  will  see,  this  approach  requires  only  min- 
imal changes  to  the  basic  Earley  algorithm.  We  then  move  on  to  briefly 
consider  an  approach  to  unification  parsing  that  moves  even  further  away 
from  standard  context-free  methods. 

Integrating  Unification  into  an  Earley  Parser 

We  have  two  goals  in  integrating  feature  structures  and  unification  into  the 
Earley  algorithm:  to  use  feature  structures  to  provide  a richer  representation 
for  the  constituents  of  the  parse,  and  to  block  the  entry  into  the  chart  of  ill- 
formed  constituents  that  violate  unification  constraints.  As  we  will  see,  these 
goals  can  be  accomplished  via  fairly  minimal  changes  to  the  original  Earley 
scheme  given  on  page  378. 

The  first  change  involves  the  various  representations  used  in  the  orig- 
inal code.  Recall  that  the  Earley  algorithm  operates  by  using  a set  of  un- 
adorned context-free  grammar  rules  to  fill  in  a data-structure  called  a chart 
with  a set  of  states.  At  the  end  of  the  parse,  the  states  that  make  up  this  chart 
represent  all  possible  parses  of  the  input.  Therefore,  we  begin  our  changes 
by  altering  the  representations  of  both  the  context-free  grammar  rules,  and 
the  states  in  the  chart. 

The  rules  are  altered  so  that  in  addition  to  their  current  components, 
they  also  include  a feature  structure  derived  from  their  unification  constraints. 
More  specifically,  we  will  use  the  constraints  listed  with  a rule  to  build  a fea- 
ture structure,  represented  as  a DAG,  for  use  with  that  rule  during  parsing. 

Consider  the  following  context-free  rule  with  unification  constraints. 
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S 


NPVP 


(NP  HEAD  AGREEMENT)  = (VP  HEAD  AGREEMENT) 

(S  HEAD)  = (VP  HEAD) 

Converting  these  constraints  into  a feature  structure  results  in  the  following 
structure: 


S 

NP 

VP 


HEAD  CD 
HEAD  AGREEMENT  [5] 

HEAD  Q]  [AGREEMENT  H] 


In  this  derivation,  we  combined  the  various  constraints  into  a single  structure 
by  first  creating  top-level  features  for  each  of  the  parts  of  the  context-free 
rule,  S,  NP,  and  VP  in  this  case.  We  then  add  further  components  to  this 
structure  by  following  the  path  equations  in  the  constraints.  Note  that  this 
is  a purely  notational  conversion;  the  DAGs  and  the  constraint  equations 
contain  the  same  information.  However,  tying  the  constraints  together  in  a 
single  feature  structure  puts  it  in  a form  that  can  be  passed  directly  to  our 
unification  algorithm. 

The  second  change  involves  the  states  used  to  represent  parti al  parses 
in  the  Earley  chart.  The  original  states  contain  fields  for  the  context-free  rule 
being  used,  the  position  of  the  dot  representing  how  much  of  the  rule  has 
been  completed,  the  positions  of  the  beginning  and  end  of  the  state,  and  a list 
of  other  states  that  represent  the  completed  sub-parts  of  the  state.  To  this  set 
of  fields,  we  simply  add  an  additional  field  to  contain  the  DAG  representing 
the  feature  structure  corresponding  to  the  state.  Note  that  when  a rule  is 
first  used  by  PREDICTOR  to  create  a state,  the  DAG  associated  with  the  state 
will  simply  consist  of  the  DAG  retrieved  from  the  rule.  For  example,  when 
Predictor  uses  the  above  S rule  to  enter  a state  into  the  chart,  the  DAG 
given  above  will  be  its  initial  DAG.  We'll  denote  states  like  this  as  follows, 
where  Dag  denotes  the  feature  structure  given  above. 

•S  — > • NP  VP,  [0,0],  [],Dag 

Given  these  representational  additions,  we  can  move  on  to  altering 
the  algorithm  itself.  The  most  important  change  concerns  the  actions  that 
take  place  when  a new  state  is  created  via  the  extension  of  an  existing  state, 
which  takes  place  in  the  COMPLETER  routine.  Recall  that  COMPLETER  is 
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called  when  a completed  constituent  has  been  added  to  the  chart.  Its  task 
is  to  attempt  to  find,  and  extend,  existing  states  in  the  chart  that  arc  looking 
for  constituents  that  arc  compatible  with  the  newly  completed  constituent. 
Completer  is,  therefore,  a function  that  creates  new  states  by  combining 
the  information  from  two  other  states,  and  as  such  is  a likely  place  to  apply 
the  unification  operation. 

To  be  more  specific,  COMPLETER  adds  a new  state  into  the  chart  by 
finding  an  existing  state  whose  • can  be  advanced  by  the  newly  completed 
state.  A • can  be  advanced  when  the  category  of  the  constituent  immediately 
following  it  matches  the  category  of  the  newly  completed  constituent.  To 
accommodate  the  use  of  feature  structures,  we  can  alter  this  scheme  by  uni- 
fying the  feature  structure  associated  with  the  newly  completed  state  with  the 
appropriate  paid  of  the  feature  structure  being  advanced.  If  this  unification 
succeeds,  then  the  DAG  of  the  new  state  receives  the  unified  structure  and  is 
entered  into  the  chart,  if  it  fails  then  no  new  state  is  entered  into  the  chart. 
The  appropriate  alterations  to  COMPLETER  arc  showin  in  Figure  11.12. 

Consider  this  process  in  the  context  of  parsing  the  phrase  That  flight, 
where  the  That  has  already  been  seen,  as  is  captured  by  the  following  state. 


NP  — > Det»Nominal[ 0, 1],  [So,,, ] . Dag \ 


Dag  i 


NP 


DET 


NOMINAL 


HEAD  Q] 


HEAD 


AGREEMENT  H]  NUMBER  SG 


HEAD  Q]  AGREEMENT  □] 


Now  consider  the  later  situation  where  the  parser  has  processed  flight  and 
has  subsequently  produced  the  following  state. 


Nominal  — > Noun*. 

[1,2],  [S/Vo un 

] • Dag2 

Dag  2 

NOMINAL 

HEAD  Q] 

NOUN 

HEAD  Q] 

AGREEMENT 

NUMBER  SG 

_ 

_ 

To  advance  the  NP  rule,  the  parser  unifies  the  feature  structure  found  under 
the  NOMINAL  feature  of  Dag2,  with  the  feature  structure  found  under  the 
NOMINAL  feature  of  the  /VP’s  Dag\.  As  in  the  original  algorithm,  a new  state 
is  created  to  represent  the  fact  that  an  existing  state  has  been  advanced.  This 
new  state’s  DAG  is  given  the  DAG  that  resulted  from  the  above  unification. 
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function  E ARLI ■:  Y-  Pa  R S [ 4 wo rds,  grammar)  returns  chart 

ENQUEUEfiY  — » • S,  [0,0],  dagy),chart[0]) 
for  i<r-  from  0 to  LENGTH(worc/.s)  do 
for  each  state  in  chart[i\  do 
if  lNCOMPLETE?(sfafe)  and 

NEXT-CAT(sfafe)  is  not  a part  of  speech  then 
PREDICTOR(sfflfe) 
elseif  lNCOMPLETE?(stofe)  and 

NEXT-CAT(stafe)  is  a part  of  speech  then 
SCANNER(state) 
else 

COMPLETER(stofe) 

end 

end 

return(c/zarf) 

procedure  Predictor((A  — >■  a •Bp.  |/./i.  dag.\)) 

for  each  (B  -a  y)  in  GRAMMAR-RULES-FORtB.grazzzzzzar)  do 
Enqueue((B  -a  »y,  [j,j],  dagB),chart[j ]) 

end 

procedure  Scanner((A  -a  a.  • B p . \i.j\.  dag  a)) 
if  B c Pa  rt  s - of-  S peec  h( wo rdf  j / ) then 

Enqueue((B  — > word[j ],  \j,j  + 1],  dags),  chart[j+l]) 

procedure  COMPLETERfiZ?  -a  y •,  [j,k],  dags)) 
for  each  (A  -a  a • B [1.  [/.  /].  dag  a)  in  chart[j]  do 
if  new-dag  ■<—  Unify- STATES(t/agg,  dagA,  B)  ^ Fails! 
Enqueue((A  -a  aB»P,  [I, k\, new — dag), chart [k]) 

end 

procedure  UNlFY-STATESfi/agi,  dag2,  cat ) 
dagl-cp  <—  CoPYDAG(dagl) 
dag2-cp  t—  CoPYDAG(t/ag2) 

UNiFY(FoLLOW-PATH(caf,  dagl-cp ),  Follow- PATH(caf,  dag2-cp)) 

procedure  ENQUEUE(sfafe,  chart-entry ) 

if  state  is  not  subsumed  by  a state  in  chart-entry  then 
PuSH(sfafe,  chart-entry) 

end 


Figure  11.12  Modifications  to  the  Earley  algorithm  to  include  unification. 
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The  final  change  to  the  original  algorithm  concerns  the  check  for  states 
already  contained  in  the  chart.  In  the  original  algorithm,  the  Enqueue  func- 
tion refused  to  enter  into  the  chart  any  state  that  was  identical  to  one  already 
present  in  the  chart.  Where  identical  meant  the  same  rule,  with  the  same 
start  and  finish  positions,  and  the  same  position  of  the  •.  It  is  this  check 
that  allows  the  algorithm  to,  among  other  things,  avoid  the  infinite  recursion 
problems  associated  with  left-recursive  rules. 

The  problem,  of  course,  is  that  our  states  arc  now  more  complex  since 
they  have  complex  feature  structures  associated  with  them.  States  that  ap- 
peared identical  under  the  original  criteria  might  in  fact  now  be  different 
since  their  associated  DAGs  may  differ.  The  obvious  solution  to  this  prob- 
lem is  to  simply  extend  the  identity  check  to  include  the  DAGs  associated 
with  the  states,  but  it  turns  out  that  we  can  improve  on  this  solution. 

The  motivation  for  the  improvement  lies  in  the  motivation  for  the  iden- 
tity check.  Its  purpose  is  to  prevent  the  wasteful  addition  of  a state  into  the 
chart  whose  effect  on  the  parse  would  be  accomplished  by  an  already  exist- 
ing state.  Put  another  way,  we  want  to  prevent  the  entry  into  the  chart  of 
any  state  that  would  duplicate  the  work  that  will  eventually  be  done  by  other 
states.  Of  course,  this  will  clearly  be  the  case  with  identical  states,  but  it 
turns  out  it  is  also  the  case  for  states  in  the  chart  that  arc  more  general  than 
new  states  being  considered. 

Consider  the  situation  where  the  chart  contains  the  following  state, 
where  the  Dag  places  no  constraints  on  the  Del. 

NP  — > »Det  NP.  [i.  i] . [] , Dag 

Such  a state  simply  says  that  it  is  expecting  a Det  at  position  i,  and  that  any 
Det  will  do. 

Now  consider  the  situation  where  the  parser  wants  to  insert  a new  state 
into  the  chart  that  is  identical  to  this  one,  with  the  exception  that  its  DAG 
restricts  the  Det  to  be  singular.  In  this  case,  although  the  states  in  question 
arc  not  identical,  the  addition  of  the  new  state  to  the  chart  would  accomplish 
nothing  and  should  therefore  be  prevented. 

To  see  this  let’s  consider  all  the  cases.  If  the  new  state  is  added,  then  a 
subsequent  singular  Det  will  match  both  rules  and  advance  both.  Due  to  the 
unification  of  features,  both  will  have  DAGs  indicating  that  their  Dets  arc 
singular,  with  the  net  result  being  duplicate  states  in  the  chart.  If  on  the  other 
hand,  a plural  Det  is  encountered,  the  new  state  will  reject  it  and  not  advance, 
while  the  old  rule  will  advance,  entering  a single  new  state  into  the  chart.  On 
the  other  hand,  if  the  new  state  is  not  placed  in  the  chart,  a subsequent  plural 
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or  singular  Det  will  match  the  more  general  state  and  advance  it,  leading  to 
the  addition  of  one  new  state  into  the  chart.  Note  that  this  leaves  us  in  exactly 
the  same  situation  as  if  the  new  state  had  been  entered  into  the  chart,  with 
the  exception  that  the  duplication  is  avoided.  In  sum,  nothing  worthwhile  is 
accomplished  by  entering  into  the  chart  a state  that  is  more  specific  than  a 
state  already  in  the  chart. 

Fortunately,  the  notion  of  subsumption  described  earlier  gives  us  a for- 
mal way  to  talk  about  the  generalization  and  specialization  relations  among 
feature  structures.  This  suggests  that  the  proper  way  to  alter  Enqueue  is  to 
check  if  a newly  created  state  is  subsumed  by  any  existing  states  in  the  chart. 
If  it  is,  then  it  will  not  be  allowed  into  the  chart.  More  specifically,  a new 
state  that  is  identical  in  terms  of  its  rule,  start  and  finish  positions,  subparts, 
and  • position,  to  an  existing  state,  will  be  not  be  entered  into  the  chart  if  its 
DAG  is  subsumed  by  the  DAG  of  an  existing  state  (ie.  if  Dagdd  C Dagnew). 
The  necessary  change  to  the  original  Earley  Enqueue  procedure  is  shown 
in  Figure  11.12. 

The  Need  for  Copying 

The  calls  to  CopyDag  within  the  Unify-State  procedure  require  some 
elaboration.  Recall  that  one  of  the  strengths  of  the  Earley  algorithm  (and  of 
the  dynamic  programming  approach  in  general)  is  that  once  states  have  been 
entered  into  the  chart  they  may  be  used  again  and  again  as  paid  of  different 
derivations,  including  ones  that  in  the  end  do  not  lead  to  successful  parses. 
This  ability  is  the  motivation  for  the  fact  that  states  already  in  the  chart  are 
not  updated  to  reflect  the  progress  of  their  •,  but  instead  arc  copied  arc  then 
updated,  leaving  the  original  states  intact  so  that  they  can  be  used  again  in 
further  derivations. 

The  call  to  CopyDag  in  Unify-State  is  required  to  preserve  this  be- 
havior because  of  the  destructive  nature  of  our  unification  algorithm.  If  we 
simply  unified  the  DAGS  associated  the  existing  states,  those  states  would 
be  altered  by  the  unification,  and  hence  would  not  be  available  in  the  same 
form  for  subsequent  uses  by  the  Completer  function.  Note  that  has  nega- 
tive consequences  regardless  of  whether  the  unification  succeeds  or  fails,  in 
either  case  the  original  states  arc  altered. 

Let’s  consider  what  would  happen  if  the  call  to  CopyDag  was  absent 
in  the  following  example  where  an  early  unification  attempt  fails. 

(11.24)  Show  me  morning  flights. 

Let’s  assume  that  our  parser  has  the  following  entry  for  the  ditransitive  ver- 
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sion  of  the  verb  show , as  well  as  the  following  transitive  and  ditransitive  verb 
phrase  rules. 


Verb  — > show 

{Verb  HEAD  SUBCAT  FIRST  CAT)  = NP 
{Verb  HEAD  SUBCAT  SECOND  CAT)  = NP 
{Verb  HEAD  SUBCAT  THIRD)  = END 

VP  -A  Verb  NP 

{VP  head)  = {Verb  head) 

{VP  HEAD  SUBCAT  FIRST  CAT)  = {NP  CAT  ) 

{VP  HEAD  SUBCAT  SECOND)  = END 

VP  — > Verb  NP  NP 

{VP  head)  = {Verb  head) 

{VP  HEAD  SUBCAT  FIRST  CAT)  = ( NP\  CAT  ) 

{VP  HEAD  SUBCAT  SECOND  CAT)  = {NP2  CAT  ) 

{VP  HEAD  SUBCAT  THIRD)  = END 

When  the  word  me  is  read,  the  state  representing  transitive  verb  phrase 
will  be  completed  since  its  dot  has  moved  to  the  end.  COMPLETER  will, 
therefore,  call  Unify-States  before  attempting  to  enter  this  complete  state 
into  the  chart.  This  will  fail  since  the  SUBCAT  structures  of  these  two  rules 
can  not  be  unified.  This  is,  of  course,  exactly  what  we  want  since  this  version 
of  show  is  ditransitive.  Unfortunately,  because  of  the  destructive  nature  of 
our  unification  algorithm  we  have  already  altered  the  DAG  attached  to  the 
state  representing  show,  as  well  as  the  one  attached  to  the  VP  thereby  ruining 
them  for  use  with  the  correct  verb  phrase  rule  later  on.  Thus,  to  make  sure 
that  states  can  be  used  again  and  again  with  multiple  derivations,  copies  are 
made  of  the  dags  associated  with  states  before  attempting  any  unifications 
involving  them. 

We  should  note  that  all  of  this  copying  can  be  quite  expensive.  As  a 
result,  a number  of  alternative  techniques  have  been  developed  that  attempt 
to  minimize  this  cost  (Pereira,  1985;  Karttunen  and  Kay,  1985;  Tomabechi, 
1991;  Kogure,  1990).  Kiefer  el  al.  (1999)  describe  a set  of  related  techniques 
used  to  speed  up  a large  unification-based  parsing  system. 
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Unification  Parsing 

A more  radical  approach  to  using  unification  in  parsing  can  be  motivated 
by  looking  at  an  alternative  way  of  denoting  our  augmented  grammar  rules. 
Consider  the  following  S rule  that  we  have  been  using  throughout  this  chap- 
ter. 

S -A  NP  VP 

(NP  HEAD  AGREEMENT)  = (VP  HEAD  AGREEMENT) 

(S  head)  = (VP  head) 

An  interesting  way  to  alter  the  context-free  paid  of  this  rule  is  to  change 
the  way  the  its  grammatical  categories  are  specified.  In  particular,  we  can 
place  the  categorical  information  about  the  parts  of  the  rule  inside  the  fea- 
ture structure,  rather  than  inside  the  context-free  paid  of  the  rule.  A typi- 
cal instantiation  of  this  approach  would  give  us  the  following  rule  (Shieber, 
1986). 

Xq  -a  X\  X2 
(Xq  cat)  = S 
(A|  cat)  = NP 
(X2  cat)  = VP 

(A,  HEAD  AGREEMENT)  = (X2  HEAD  AGREEMENT) 

( Xq  HEAD)  = (X2  HEAD) 

Focusing  solely  on  the  context-free  component  of  the  rule,  this  rule 
now  simply  states  that  the  Xq  constituent  consists  of  two  components,  and 
that  the  the  X\  constituent  is  immediately  to  the  left  of  the  X2  constituent. 
The  information  about  the  actual  categories  of  these  components  is  placed 
inside  the  rule’s  feature  structure;  in  this  case,  indicating  that  Xq  is  an  S,  X\ 
is  an  NP,  and  X2  is  a VP.  Altering  the  Earley  algorithm  to  deal  with  this 
notational  change  is  trivial.  Instead  of  seeking  the  categories  of  constituents 
in  the  context-free  components  of  the  rule,  it  simply  needs  to  look  at  the  CAT 
feature  in  the  DAG  associated  with  a rule. 

Of  course,  since  it  is  the  case  that  these  two  rules  contain  precisely  the 
same  information,  it  isn’t  clear  that  there  is  any  benefit  to  this  change.  To 
see  the  potential  benefit  of  this  change,  consider  the  following  rules. 

Ao  -A  X[  X2 

(Xq  CAT)  = ( X\  CAT) 

(X2  cat)  = PP 
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Xq  — > Xi  and  X2 

(Xi  CAT)  = { X2  CAT) 

(X0  CAT)  = { X\  cat) 

The  first  rule  is  an  attempt  to  generalize  over  various  rules  that  we  have 
already  seen,  such  as  NP  — > NP  PP  and  VP  — > VP  PP.  It  simply  states  that 
any  category  can  be  followed  by  a prepositional  phrase,  and  that  the  resulting 
constituent  has  the  same  category  as  the  original.  Similarly,  the  second  rule 
is  an  attempt  to  generalize  over  rules  such  as  S -a  S and  S,  NP  — > NP  and  NP, 
and  so  on. 1 It  states  that  any  constituent  can  be  conjoined  with  a constituent 
of  the  same  category  to  yield  a new  category  of  the  same  type.  What  these 
rules  have  in  common  is  their  use  of  context-free  rules  that  contain  con- 
stituents with  constrained,  but  unspecified,  categories,  something  that  can 
not  be  accomplished  with  our  old  rule  format. 

Of  course,  since  these  rules  rely  on  the  use  the  CAT  feature,  their  ef- 
fect could  be  approximated  in  the  old  format  by  simply  enumerating  all  the 
various  instantiations  of  the  rule.  A more  compelling  case  for  the  new  ap- 
proach is  motivated  by  the  existence  of  grammatical  rules,  or  constructions, 
that  contain  constituents  that  arc  not  easily  characterized  using  any  existing 
syntactic  category. 

Consider  the  following  examples  of  the  English  How-Many  con- 
struction from  the  WSJ  (Jurafsky,  1992). 

(1 1.25)  How  early  does  it  open? 

(11.26)  How  deep  is  her  Greenness? 

(1 1.27)  How  papery  arc  your  profits? 

(11.28)  How  quickly  we  forget. 

(11.29)  How  many  of  you  can  name  three  famous  sporting  Blanchards? 

As  is  illustrated  in  these  examples,  the  How-Many  construction  has  two 
components:  the  lexical  item  how,  and  a lexical  item  or  phrase  that  is  rather 
hai'd  to  characterize  syntactically.  It  is  this  second  element  that  is  of  interest 
to  us  here.  As  these  examples  show,  it  can  be  an  adjective,  adverb,  or  some 
kind  of  quantified  phrase  (although  not  all  members  of  these  categories  yield 
grammatical  results).  Clearly,  a better  way  to  describe  this  second  element 
is  as  a scalar  concept,  a constraint  can  captured  using  feature  structures,  as 
in  the  following  rule. 

1 These  rules  should  not  be  mistaken  for  correct,  or  complete,  accounts  of  the  phenomena 
in  question. 
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*0  ->  X!  X2 

(Xi  ORTH)  = {how) 

(X2  SEM)  = ( SCALAR) 

A complete  account  of  rules  like  this  involves  semantics  and  will  therefore 
have  to  wait  for  Chapter  14.  The  key  point  here  is  that  by  using  feature 
structures  a grammatical  rule  can  place  constraints  on  its  constituents  in  a 
manner  that  does  not  make  any  use  of  the  notion  of  a syntactic  category. 

Of  course,  dealing  this  kind  of  rule  requires  some  changes  to  our  pars- 
ing scheme.  All  of  the  parsing  approaches  we  have  considered  thus  far  arc 
driven  by  the  syntactic  category  of  the  various  constituents  in  the  input.  More 
specifically,  they  arc  based  on  simple  atomic  matches  between  the  categories 
that  have  been  predicted,  and  categories  that  have  been  found.  Consider,  for 
example,  the  operation  of  the  COMPLETER  function  shown  in  Figure  11.12. 
This  function  searches  the  chart  for  states  that  can  be  advanced  by  a newly 
completed  state.  It  accomplishes  this  by  matching  the  category  of  the  newly 
completed  state  against  the  category  of  the  constituent  following  the  • in  the 
existing  state.  Clearly  this  approach  will  run  into  trouble  when  there  arc  no 
such  categories  to  consult. 

The  remedy  for  this  problem  with  COMPLETER  is  to  search  the  chart 
for  states  whose  DAGs  unify  with  the  DAG  of  the  newly  completed  state. 
This  eliminates  any  requirement  that  states  or  rules  have  a category.  The 
PREDICTOR  can  be  changed  in  a similar  fashion  by  having  it  add  states  to 
the  chart  states  whose  Xq  DAG  component  can  unify  with  the  constituent 
following  the  • of  the  predicting  state.  Exercise  11.6  asks  you  to  make  die 
necessary  changes  to  the  pseudo-code  in  Figure  11.12  to  effect  this  style  of 
parsing.  Exercise  1 1.7  asks  you  to  consider  some  of  the  implications  of  these 
alterations,  particularly  with  respect  to  prediction. 

11.6  Types  and  Inheritance 


I am  surprised  that  ancient  and  modem  writers  have  not  attributed 
greater  importance  to  the  laws  of  inheritance. . . 

(de  Tocqueville,  1966) 

The  basic  feature  structures  we  have  presented  so  far  have  two  prob- 
lems that  have  led  to  extensions  to  the  formalism.  The  first  problem  is  that 
there  is  no  way  to  place  a constraint  on  what  can  be  the  value  of  a feature. 
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For  example,  we  have  implicitly  assumed  that  the  NUMBER  attribute  can  take 
only  SG  and  PL  as  values.  But  in  our  current  system,  there  is  nothing,  for  ex- 
ample, to  stop  NUMBER  from  have  the  value  3rd  or  FEMININE  as  values: 

NUMBER  FEMININE 

This  problem  has  caused  many  unification-based  grammatical  theories 
to  add  various  mechanisms  to  try  to  constrain  the  possible  values  of  a fea- 
ture. Formalisms  like  Functional  Unification  Grammar  (FUG)  (Kay,  1979, 
1984,  1985)  and  Lexical  Functional  Grammar  (LFG)  (Bresnan,  1982),  for 
example,  focused  on  ways  to  keep  intransitive  verb  like  sneeze  from  uni- 
fying with  a direct  object  ( Marie  sneezed  Pauline).  This  was  addressed  in 
none  FUG  by  adding  a special  atom  none  which  is  not  allowed  to  unify  with  any- 

thing, and  in  LFG  by  adding  coherence  conditions  which  specified  when  a 
feature  should  not  be  filled.  Generalized  Phrase  Structure  (GPSG)  (Gazdar 
et  ah,  1985,  1988)  added  a class  of  feature  co-occurrence  restrictions,  to 
prevent,  for  example,  nouns  from  having  some  verbal  properties. 

The  second  problem  with  simple  feature  structures  is  that  there  is  no 
way  to  capture  generalizations  across  them.  For  example,  the  many  types  of 
English  verb  phrases  described  in  the  Subcategorization  section  on  page  407 
share  many  features,  as  do  the  many  kinds  of  subcategorization  frames  for 
verbs.  Syntactitions  were  looking  for  ways  to  express  these  generalities 
types  A general  solution  to  both  of  these  problems  is  the  use  of  types.  Type 

systems  for  unification  grammars  have  the  following  characteristics: 


APPROPRI- 

ATENESS 


TYPE 

HIERARCHY 


TYPED 

FEATURE 

STRUCTURE 


SIMPLE 

TYPES 

COMPLEX 

TYPES 


TYPE 

HIERARCHY 


1.  Each  feature  structure  is  labeled  by  a type. 

2.  Conversely,  each  type  has  appropriateness  conditions  expressing  which 
features  arc  appropriate  for  it. 

3.  The  types  arc  organized  into  a type  hierarchy,  in  which  more  specific 
types  inherit  properties  of  more  abstract  ones. 

4.  The  unification  operation  is  modified  to  unify  the  types  of  feature  struc- 
tures in  addition  to  unifying  the  attributes  and  values. 

In  such  typed  feature  structure  systems,  types  arc  a new  class  of 
objects,  just  like  attributes  and  values  were  for  standard  feature  structures. 
Types  come  in  two  kinds:  simple  types  (also  called  atomic  types),  and  com- 
plex types.  Let’s  begin  with  simple  types.  A simple  type  is  an  atomic  sym- 
bol like  sg  or  pi  (we  will  use  boldface  for  all  types),  and  replaces  the  simple 
atomic  values  used  in  standard  feature  structures.  All  types  arc  organized 
into  a multiple-inheritance  type  hierarchy  (a  partial  order  or  lattice).  Fig- 
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ure  11.13  shows  the  type  hierarchy  for  the  new  type  agr,  which  will  be  the 
type  of  the  kind  of  atomic  object  that  can  be  the  value  of  an  AGREE  feature. 


agr 


1st  3rd  sg  pi 


3sg-masc  3sg-fem  3sgAieut 


Figure  11.13  A simple  type  hierarchy  for  the  subtypes  of  type  agr  which 
can  be  the  value  of  the  AGREE  attribute.  After  Carpenter  (1992). 


In  the  hierarchy  in  Figure  11.13,  3rd  is  a subtype  of  agr,  and  3-sg  is 
a subtype  of  both  3rd  and  sg.  Types  can  be  unified  in  the  type  hierarchy; 
the  unification  of  any  two  types  is  the  most-general  type  that  is  more  specific 
than  the  two  input  types.  Thus: 

3rd  U sg  = 3sg 

1st  U pi  = lpl 

1st  U agr  = 1st 

3rd  U 1st  = undefined 

The  unification  of  two  types  which  do  not  have  a defined  unifier  is 
undefined,  although  it  is  also  possible  to  explicitly  represent  this  fail  type 
using  the  symbol  _L  (Ait-Kaci,  1984). 

The  second  kind  of  types  are  complex  types,  which  specify: 

• A set  of  features  that  arc  appropriate  for  that  type 

• Restrictions  on  the  values  of  those  features  (expressed  in  terms  of 

types) 

• Equality  constraints  between  the  values 

Consider  a simplified  representation  of  the  complex  type  verb,  which 
just  represents  agreement  and  verb  morphological  form  information.  A defi- 
nition of  verb  would  define  the  two  appropriate  features,  AGREE  and  VFORM, 
and  would  also  define  the  type  of  the  values  of  the  two  features.  Let’s  sup- 
pose that  the  AGREE  feature  takes  values  of  type  agr  defined  in  Figure  11.13 
above,  and  the  VFORM  feature  takes  values  of  type  vform  (where  vform  sub- 
sumes the  7 subtypes  finite,  infinitive,  gerund,  base,  present-participle, 
past-participle,  and  passive-participle.  Thus  verb  would  be  defined  as  fol- 


SUBTYPE 


FAIL  TYPE 
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lows  (where  the  convention  is  to  indicate  the  type  either  at  the  type  of  the 
AVM  or  just  to  the  lower  left  of  the  left  bracket): 

verb 

AGREE  agr 

vform  vform 

By  contrast,  the  type  noun  might  be  defined  with  the  AGREE  feature, 
but  without  the  VFORM  feature: 

noun 

AGREE  agr 

The  unification  operation  is  augmented  for  typed  feature  structures  just 
by  requiring  that  the  type  of  the  two  structures  must  unify  in  addition  to  the 
values  of  the  component  features  unifying. 


verb 

U 

verb 

= 

verb 

AGREE  1st 

AGREE  Sg 

AGREE  1-Sg 

vform  gerund 

vform  gerund 

VFORM  gerund 

Complex  types  arc  also  paid  of  the  type  hierarchy.  Subtypes  of  complex 
types  inherit  all  the  features  of  their  parents,  together  with  the  constraints 
on  their  values.  Sanfilippo  (1993),  for  example,  uses  the  type  hierarchy  to 
encode  the  hierarchical  structure  of  the  lexicon.  Figure  11.14  shows  a small 
paid  of  this  hierarchy,  the  paid  that  models  the  various  subcategories  of  verbs 
which  take  sentential  complements;  these  arc  divided  into  the  transitive  ones 
(which  take  direct  objects:  (ask  yourself  whether  you  have  become  better 
informed)  and  the  intransitive  ones  ( Monsieur  asked  whether  I wanted  to 
ride).  The  type  trans-comp-cat  would  introduce  the  required  direct  object, 
constraining  it  to  be  of  type  noun-phrase,  while  types  like  sbase-comp- 
cat  would  introduce  the  baseform  (hare  stem)  complement  and  constraint  its 
vform  to  be  the  baseform. 

Extensions  to  Typing 

Typed  feature  structures  can  be  extended  by  allowing  inheritance  with  de- 
defaults  faults.  Default  systems  have  been  mainly  used  in  lexical  type  hierarchies 
of  the  sort  described  in  the  previous  section,  in  order  to  encode  generaliza- 
tions and  subregular  exceptions  to  them.  In  early  versions  of  default  unifi- 
unionity  cation  the  operation  was  order-dependent,  based  on  the  priority  union  op- 
eration (Kaplan,  1987).  More  recent  architectures,  such  as  Lascarides  and 
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comp-cat 


tr-sfin-comp-cat  ^^^--^Ir-sbase^^Comp-cat  intr-sinl^comp-cat 

tr-swh-comp-cat  intr-sfin-comp-cat  intr-sbase-comp-cat 


Figure  11.14  Part  of  the  type  hierarchy  for  the  verb  type  verb-cat,  showing 
the  subtypes  of  the  comp-cat  type.  These  are  all  subcategories  of  verbs  which 
take  sentential  complements.  After  Sanfilippo  (1993). 


Copestake  (1997)  default  unification  for  typed  feature  structures,  are  order- 
independent,  drawing  on  Young  and  Rounds  (1993)  and  related  to  Reiter’s 
default  logic  (Reiter,  1980). 

Many  unification-based  theories  of  grammar,  including  HPSG  (Pollard 
and  Sag,  1987,  1994)  and  LFG  (Bresnan,  1982)  use  an  additional  mechanism 
besides  inheritance  for  capturing  lexical  generalizations,  the  lexical  rule,  lexical  rule 
Lexical  rules  express  lexical  generalizations  by  allowing  a reduced,  hence 
more  redundant-free  lexicon  to  be  automatically  expanded  by  the  rules.  Pro- 
posed originally  by  Jackendoff  (1975),  see  Pollard  and  Sag  (1994)  for  exam- 
ples of  modern  lexical  rules,  Carpenter  (1991)  for  a discussion  of  complexity 
issues,  and  Meurers  and  Minnen  (1997)  for  a recent  efficient  implementa- 
tion. Some  authors  have  proposed  using  the  type  hierarchy  to  replace  lexical 
rules,  either  by  adding  abstract  types  and  some  disjunctions  Krieger  and  Ner- 

TYPE 

bonne  (1993)  or  via  type  underspecification  and  dynamic  typing,  in  which  underspeci- 
underspecified  types  arc  combined  to  make  new  types  on-line  (Koenig  and  typing10 
Jurafsky,  1995). 

Types  can  also  be  used  to  represent  constituency.  Rules  like  (11.13) 
on  page  407  used  a normal  phrase  structure  rule  template  and  added  the  fea- 
tures via  path  equations.  Instead,  it’s  possible  to  represent  the  whole  phrase 
structure  rule  as  a type.  In  order  to  do  this,  we  need  a way  to  represent  con- 
stituents as  features.  One  way  to  do  this,  following  Sag  and  Wasow  (1999),  is 
to  take  a type  phrase  which  has  a feature  called  DTRS  (‘daughters’),  whose 
value  is  a list  of  phrases.  For  example  the  phrase  I love  New  York  could  have 
the  following  representation,  (showing  only  the  DTRS  feature): 

[phrase  ' 


CAT  VP 

1 

DTRS  ( 

CAT  PRO 
ORTH  I 

5 

DTRS  ( 

CAT  V 
ORTH  LOVE 

5 

CAT  NP 

orth  New  York 

) 
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PATH 

INEQUATIONS 

NEGATION 

SET-VALUED 

FEATURES 

DISJUNCTION 
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Other  Extensions  to  Unification 

There  arc  many  other  extensions  to  unification  besides  typing,  including 
path  inequations  (Moshier,  1988;  Carpenter,  1992;  Carpenter  and  Penn, 
1994)  negation  (Johnson,  1988,  1990),  set-valued  features  (Pollard  and 
Moshier,  1990),  and  disjunction  Kay  (1979),  Kasper  and  Rounds  (1986). 
In  some  unification  systems  these  operations  arc  incorporated  into  feature 
structures.  Kasper  and  Rounds  (1986)  and  others,  by  contrast,  implement 
them  in  a separate  metalanguage  which  is  used  to  describe  feature  structures. 
This  idea  derives  from  the  work  of  Pereira  and  Shieber  (1984),  and  even  ear- 
lier work  by  Kaplan  and  Bresnan  (1982),  all  of  whom  distinguished  between 
a metalanguage  for  describing  feature  structures  and  the  actual  feature  struc- 
tures themselves.  The  descriptions  may  thus  use  negation  and  disjunction  to 
describe  a set  of  feature  structures  (i.e.  a certain  feature  must  not  contain  a 
certain  value,  or  may  contain  any  of  a set  of  values),  but  an  actual  instance 
of  a feature  structure  that  meets  the  description  would  not  have  negated  or 
disjoint  values. 


Summary 

This  chapter  introduced  feature  structures  and  the  unification  operation  which 
is  used  to  combine  them. 

• A feature  structure  is  a set  of  features-value  pairs,  where  features  arc 
unanalyzable  atomic  symbols  drawn  from  some  finite  set,  and  val- 
ues arc  either  atomic  symbols  or  feature  structures.  They  arc  repre- 
sented either  as  attribute-value  matrices  (AYMs)  or  as  acyclic  graphs 
(DAGs),  where  features  arc  directed  labeled  edges  and  feature  values 
arc  nodes  in  the  graph. 

• Unification  is  the  operation  for  both  combining  information  (merging 
the  information  content  of  two  feature  structures)  and  comparing  in- 
formation (rejecting  the  merger  of  incompatible  features). 

• A phrase-structure  rule  can  be  augmented  with  feature  structures,  and 
with  feature  constraints  expressing  relations  among  the  feature  struc- 
tures of  the  constituents  of  the  rule.  Subcategorization  constraints  can 
be  represented  as  feature  structures  on  head  verbs  (or  other  predicates). 
The  elements  which  arc  subcategorized  for  by  a verb  may  appeal-  in  the 
verb  phrase  or  may  be  realized  apart  from  the  verb,  as  a long-distance 
dependency. 
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• Feature  structures  can  be  typed.  The  resulting  typed  feature  struc- 
tures place  constraints  on  which  type  of  values  a given  feature  can 
take,  and  can  also  be  organized  into  a type  hierarchy  to  capture  gen- 
eralizations across  types. 


Bibliographical  and  Historical  Notes 

The  use  of  features  in  linguistic  theory  comes  originally  from  phonology. 
Anderson  (1985)  credits  Jakobson  (1939)  with  being  the  first  to  use  features 
(called  distinctive  features)  as  an  ontological  type  in  a theory,  drawing  on 
previous  uses  of  features  by  Trubetskoi  (1939)  and  others.  The  semantic  use 
of  features  followed  soon  after;  see  Chapter  16  for  the  history  of  componen- 
tial  analysis  in  semantics.  Features  in  syntax  were  well  established  by  the 
50s  and  were  popularized  by  Chomsky  (1965). 

The  unification  operation  in  linguistics  was  developed  independently 
by  Kay  (1979)  (feature  structure  unification)  and  Colmerauer  (1970,  1975) 
(term  unification).  Both  were  working  in  machine  translation  and  looking 
for  a formalism  for  combining  linguistic  information  which  would  be  re- 
versible. Colmerauer’s  original  Q-system  was  a bottom-up  parser  based  on 
a series  of  rewrite  rules  which  contained  logical  variables,  designed  for  a 
English  to  French  machine  translation  system.  The  rewrite  rules  were  re- 
versible to  allow  them  to  work  for  both  parsing  and  generation.  Colmerauer, 
Fernand  Didier,  Robert  Pasero,  Philippe  Roussel,  and  Jean  Trudel  designed 
the  Prolog  language  based  on  extended  Q-systems  to  full  unification  based 
on  the  resolution  principle  of  Robinson  (1965),  and  implemented  a French 
analyzer  based  on  it  (Colmerauer  and  Roussel,  1996).  The  modern  use  of 
Prolog  and  term  unification  for  natural  language  via  Definite  Clause  Gram- 
mar's was  based  on  Colmerauer’s  (1975)  metamorphosis  grammars,  and  was 
developed  and  named  by  Pereira  and  Warren  (1980).  Meanwhile  Martin  Kay 
and  Ron  Kaplan  had  been  working  with  ATN  grammars.  In  an  ATN  analysis 
of  a passive,  the  first  NP  would  be  assigned  to  the  subject  register,  then  when 
the  passive  verb  was  encountered,  the  value  would  be  moved  into  the  object 
register.  In  order  to  make  this  process  reversible,  they  restricted  assignments 
to  registers  so  that  certain  registers  could  only  be  filled  once,  i.e.  couldn’t  be 
overwritten  once  written.  They  thus  moved  toward  the  concepts  of  logical 
variables  without  realizing  it.  Kay’s  original  unification  algorithm  was  de- 
signed for  feature  structures  rather  than  terms  (Kay,  1979).  The  integration 
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of  unification  into  an  Earley-style  approach  given  in  Section  11.5  is  based 
on  (Shieber,  1985b). 

See  Shieber  (1986)  for  a clear  introduction  to  unification,  and  Knight 
(1989)  for  a multidisciplinary  survey  of  unification. 

Inheritance  and  appropriateness  conditions  were  first  proposed  for  lin- 
guistic knowledge  by  Bobrow  and  Webber  (1980)  in  the  context  of  an  ex- 
tension of  the  KL-ONE  knowledge  representation  system  (Brachman  and 
Schmolze,  1985b).  Simple  inheritance  without  appropriateness  conditions 
was  taken  up  by  number  of  researchers;  early  users  include  Jacobs  (1985) 
& (1987)  and  Flickinger  et  al.  (1985).  Ai't-Kaci  (1984)  borrowed  the  no- 
tion of  inheritance  in  unification  from  the  logic  programming  community. 
Typing  of  feature  structures,  including  both  inheritance  and  appropriateness 
conditions,  was  independently  proposed  by  Calder  (1987),  Pollard  and  Sag 
(1987),  and  Elhadad  (1990).  Typed  feature  structures  were  formalized  by 
King  (1989)  and  Carpenter  (1992).  There  is  an  extensive  literature  in  the 
use  of  type  hierarchies  in  linguistics,  particularly  for  capturing  lexical  gen- 
eralizations; besides  the  papers  previously  discussed,  the  interested  reader 
should  consult  Evans  and  Gazdar  (1996)  for  a description  of  the  DATR  lan- 
guage, designed  for  defining  inheritance  networks  for  linguistic  knowledge 
representation,  Fraser  and  Hudson  (1992)  for  the  use  of  inheritance  in  a de- 
pendency grammar  and  Daelemans  et  al.  (1992)  for  a general  overview.  For- 
malisms and  systems  for  the  implementation  of  constraint-based  grammars 
via  typed  feature  structures  include  PAGE  (?),  ALE  (Carpenter  and  Penn, 
1994),  and  ConTroll  (Gotz  et  al. , 1997). 

Grammatical  theories  based  on  unification  include  Lexical  Functional 
Grammar  (LFG)  (Bresnan,  1982),  Head-Driven  Phrase  Structure  Grammar 
(HPSG)  (Pollard  and  Sag,  1987,  1994),  Construction  Grammar  (Kay  and 
Fillmore,  1999),  and  Unification  Categorial  Grammar  (Uszkoreit,  1986). 


Exercises 

11.1  Draw  the  DAGs  corresponding  to  the  AVMs  given  in  Examples  11.1 
and  11.2. 

11.2  Consider  the  following  BERP  examples,  focusing  on  their  use  of  pro- 


nouns. 
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I want  to  spend  lots  of  money. 

Tell  me  about  Chez-Panisse. 

I’d  like  to  take  her  to  dinner. 

She  doesn’t  like  mexican. 

Assuming  that  these  pronouns  all  belong  to  the  category  Pro , write  lexical 
and  grammatical  entries  with  unification  constraints  that  block  the  following 
examples. 

*Me  want  to  spend  lots  of  money. 

*Tell  I about  Chez-Panisse. 

*1  would  like  to  take  she  to  dinner. 

*Her  doesn’t  like  mexican. 

11.3  Draw  a picture  of  the  subsumption  semilattice  corresponding  to  the 
feature  structures  in  Examples  11.3  to  11.8.  Be  sure  to  include  the  most 
general  feature  structure  []. 

11.4  Consider  the  following  examples. 

The  sheep  arc  baaaaing. 

The  sheep  is  baaaaing. 

Create  appropriate  lexical  entries  for  the  words  the,  sheep , and  baaaaing. 
Show  that  your  entries  permit  the  correct  assignment  of  a value  to  the  NUM- 
BER feature  for  the  subjects  of  these  examples,  as  well  as  their  various 
parts. 

11.5  Create  feature  structures  expressing  the  different  subcat  frames  for 
while  and  during  shown  on  page  412. 

11.6  Alter  the  pseudocode  shown  in  Figure  11.12  so  that  it  performs  the 
more  radical  kind  of  unification  parsing  described  on  page  431. 

11.7  Consider  the  following  problematic  grammar  suggested  by  Shieber 
(1985b). 

5 — > r 
(Tf)  = a 

7j  -u  T2A 

(Ti  F)  = (T2  f f) 

S — s-  A 
A — > a 
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Show  the  first  S state  entered  into  the  chart  using  your  modified  PRE- 
DICTOR from  the  previous  exercise,  then  describe  any  problematic  behavior 
displayed  by  PREDICTOR  on  subsequent  iterations.  Discuss  the  cause  of  the 
problem  and  how  in  might  be  remedied. 

11.8  Using  the  list  approach  to  representing  a verb’s  subcategorization 
frame,  show  how  a grammar  could  handle  any  number  of  verb  subcatego- 
rization frames  with  only  the  following  two  VP  rules.  More  specifically, 
show  the  constraints  that  would  have  to  be  added  to  these  rules  to  make  this 
work. 

VP  ->  Verb 
VP  ->  VPX 

The  solution  to  this  problem  involves  thinking  about  a recursive  walk  down 
a verb’s  subcategorization  frame.  This  is  a hard  problem;  you  might  consult 
Shieber  (1986)  if  you  get  stuck. 

11.9  Page  437  showed  how  to  use  typed  feature  structure  to  represent  con- 
stituency. Use  that  notation  to  represent  rules  11.13,  11.14,  and  11.15  shown 
on  page  407. 


LEXICALIZED  AND 
PROBABILISTIC  PARSING 


Two  wads  diverged  in  a yellow  wood, 

And  sorry  I could  not  travel  both 
And  be  one  traveler,  long  I stood 
And  looked  down  one  as  far  as  I could 
To  where  it  bent  in  the  undergrowth. . . 

Robert  Frost  The  Road  Not  Taken 


The  characters  in  Damon  Runyon’s  short  stories  arc  willing  to  bet  “on  any 
proposition  whatever”,  as  Runyon  says  about  Sky  Masterson  in  The  Idyll  of 
Miss  Sarah  Brown',  from  the  probability  of  getting  aces  back-to-back  to  the 
odds  against  a man  being  able  to  throw  a peanut  from  second  base  to  home 
plate.  There  is  a moral  here  for  language  processing:  with  enough  knowl- 
edge we  can  figure  the  probability  of  just  about  anything.  The  last  three 
chapters  have  introduced  sophisticated  models  of  syntactic  structure  and  its 
parsing.  In  this  chapter  we  show  that  it  is  possible  to  build  probabilistic  mod- 
els of  sophisticated  syntactic  information  and  use  some  of  this  probabilistic 
information  in  efficient  probabilistic  parsers. 

Of  what  use  arc  probabilistic  grammars  and  parsers?  One  key  contri- 
bution of  probabilistic  parsing  is  to  disambiguation.  Recall  that  sentences 
can  be  very  ambiguous;  the  Earley  algorithm  of  Chapter  10  could  repre- 
sent these  ambiguities  in  an  efficient  way,  but  was  not  equipped  to  resolve 
them.  A probabilistic  grammar  offers  a solution  to  the  problem:  choose 
the  most-probable  interpretation.  Thus,  due  to  the  prevalence  of  ambiguity, 
probabilistic  parsers  can  play  an  important  role  in  most  parsing  or  natural- 
language  understanding  task. 


444 


Chapter  12.  Lexicalized  and  Probabilistic  Parsing 


12.1 


PCFG 

SCFG 


Another  important  use  of  probabilistic  grammars  is  in  language  mod- 
eling for  speech  recognition  or  augmentative  communication.  We  saw  that 
/V-gram  grammars  were  important  in  helping  speech  recognizers  in  predict- 
ing upcoming  words,  helping  constrain  the  search  for  words.  Probabilistic 
versions  of  more  sophisticated  grammars  can  provide  additional  predictive 
power  to  a speech  recognizer.  Indeed,  since  humans  have  to  deal  with  the 
same  problems  of  ambiguity  as  do  speech  recognizers,  it  is  significant  that 
we  arc  finding  psychological  evidence  that  people  use  something  like  these 
probabilistic  grammars  in  human  language -processing  tasks  (reading,  hu- 
man speech  understanding). 

This  integration  of  sophisticated  structural  and  probabilistic  models  of 
syntax  is  at  the  very  cutting  edge  of  the  field.  Because  of  its  newness,  no 
single  model  has  become  standard,  in  the  way  the  context-free  grammar  has 
become  a standard  for  non-probabilistic  syntax.  We  will  explore  the  field 
by  presenting  a number  of  probabilistic  augmentations  to  context-free  gram- 
mar's, showing  how  to  parse  some  of  them,  and  suggesting  directions  the 
field  may  take.  The  chapter  begins  with  probabilistic  context-free  gram- 
mars (PCFGs),  a probabilistic  augmentation  of  context-free  grammars,  to- 
gether with  the  CYK  algorithm,  a standard  dynamic  programming  algo- 
rithm for  parsing  PCFGs.  We  then  show  two  simple  extensions  to  PCFGs 
to  handle  probabilistic  subcategorization  information  and  probabilistic  lex- 
ical dependencies,  give  an  evaluation  metric  for  evaluating  parsers,  and  then 
introduce  some  advanced  issues  and  some  discussion  of  human  parsing. 


Probabilistic  Context-Free  Grammars 

The  simplest  augmentation  of  the  context-free  grammar  is  the  Probabilistic 
Context-Free  Grammar  (PCFG),  also  known  as  the  Stochastic  Context- 
Free  Grammar  (SCFG),  first  proposed  by  Booth  (1969). 

Recall  that  a context-free  grammar  G is  defined  by  four  parameters 
(N,L,P,S): 

1.  a set  of  nonterminal  symbols  (or  ‘variables’)  N 

2.  a set  of  terminal  symbols  £ (disjoint  from  N ) 

3.  a set  of  productions  P,  each  of  the  form  A — > p,  where  A is  a non- 
terminal and  P is  a string  of  symbols  from  the  infinite  set  of  strings 

(EUA)*. 

4.  a designated  start  symbol  S 
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S — > NP  VP 

S -»  Aux  NP  VP 

S ->  VP 

NP  — > DetNom 

NP  — > Proper-Noun 

NP  — > Nom 

NP  — > Pronoun 

Nom  — > Noun 

Nom  — > Noun  Nom 

Nom  — > Proper-Noun  Nom 

VP  -f  Verb 

VP  ->  Verb  NP 

VP  -»  Verb  NP  NP 


[.80] 

[.15] 

[.05] 

[.20] 

[.35] 

[.05] 

[.40] 

[.75] 

[.20] 

[.05] 

[.55] 

[.40] 

[.05] 


Det  — » that  [.05]  | the[. 80]  | a [.15] 

Noun  — > book  [.10] 

Noun  — > flights  [.50] 

Noun  — > men/  [.40] 

Vbrft  — > book  [.30] 

VbrZ?  — > include  [.30] 

Verb  — > want  [.40] 

— > can  [.40] 

Aux  — > does  [.30] 

A mx  — » do  [.30] 

Proper-Noun  — > TWA  [.40] 

Proper-Noun  — > Denver  [.40] 

Pronoun  — > you  [.40]  j / [.60] 


Figure  12.1  A PCFG;  a probabilistic  augmentation  of  the  miniature  En- 
glish grammar  and  lexicon  in  Figure  10.2.  These  probabilities  are  not  based 
on  a corpus;  they  were  made  up  merely  for  expository  purposes. 


A probabilistic  context-free  grammar  augments  each  rule  in  P with  a 
conditional  probability: 

A -A  [3  [p]  (12.1) 

A PCFG  is  thus  a 5-tuple  G = (N.  £,  P.  S,  D),  where  D is  a function 
assigning  probabilities  to  each  rule  in  P.  This  function  expresses  the  proba- 
bility p that  the  given  nonterminal  A will  be  expanded  to  the  sequence  [3;  it 
is  often  referred  to  as 

P(A  ^ P) 

or  as 

P(A  P|A) 

Formally  this  is  conditional  probability  of  a given  expansion  given  the 
left-hand-size  nonterminal  A.  Thus  if  we  consider  all  the  possible  expansions 
of  a nonterminal,  the  sum  of  their  probabilities  must  be  1.  Figure  12.1  shows 
a sample  PCFG  for  a miniature  grammar  with  only  three  nouns  and  three 
verbs.  Note  that  the  probabilities  of  all  of  the  expansions  of  a nonterminal 
sum  to  1 . Obviously  in  any  real  grammar  there  are  a great  many  more  rules 
for  each  nonterminal  and  hence  the  probabilities  of  any  particular  rule  are 
much  smaller. 
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How  arc  these  probabilities  used?  A PCFG  can  be  used  to  estimate  a 
number  of  useful  probabilities  concerning  a sentence  and  its  parse-tree(s). 
For  example  a PCFG  assigns  a probability  to  each  parse-tree  T (i.e.  each 
derivation)  of  a sentence  S.  This  attribute  is  useful  in  disambiguation. 
For  example,  consider  the  two  parses  of  the  sentence  “Can  you  book  TWA 
flights”  (one  meaning  ‘Can  you  book  flights  on  behalf  of  TWA’ , and  the  other 
meaning  ‘Can  you  book  flights  run  by  TWA)  shown  in  Figure  12.2. 

The  probability  of  a particular  parse  T is  defined  as  the  product  of  the 
probabilities  of  all  the  rules  r used  to  expand  each  node  n in  the  parse  tree: 

P(r,s)  = n />(>-(«))  (12.2) 

n^T 

The  resulting  probability  P(T,S)  is  both  the  joint  probability  of  the 
parse  and  the  sentence,  and  also  the  probability  of  the  parse  P(T).  How  can 
this  be  true?  First,  by  the  definition  of  joint  probability: 

P(T,S)  =P{T)P{S\T)  (12.3) 

But  since  a parse  tree  includes  all  the  words  of  the  sentence,  P(.S  T ) is 
1.  Thus: 

P(T,S)=P(T)P(S\T)=P(T)  (12.4) 

The  probability  of  each  of  the  frees  in  Figure  12.2  can  be  computed  by 
multiplying  together  each  of  the  rules  used  in  the  derivation.  For  example, 
the  probability  of  the  left  tree  in  Figure  12.2a  (call  it  7})  and  the  right  free 
(12.2b  or  Tr)  can  be  computed  as  follows: 

P(Tj)  = .15  * .40  * .05  * .05  * .35  * .75  * .40  * .40  * .40 
*.30  * .40  * .50 

= 1.5  x 10~6  (12.5) 

P(Tr)  = .15  * .40*  .40*  .05  * .05  * .75  * .40*  .40  * .40 
*.30  * .40  * .50 

= 1.7  x 10-6  (12.6) 

We  can  see  that  the  right  tree  in  Figure  12.2(b)  has  a higher  probability. 
Thus  this  parse  would  correctly  be  chosen  by  a disambiguation  algorithm 
which  selects  the  parse  with  the  highest  PCFG  probability. 

Let’s  formalize  this  intuition  that  picking  the  parse  with  the  highest 
probability  is  the  correct  way  to  do  disambiguation.  The  disambiguation 
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Pro 
can  you 


Nom 
PNoun  Noun 
book  TWA  flights 


can  you  book  TWA  flights 


Rules 

P 

Rules 

P 

s 

-> 

Aux  NP  VP 

.15 

S 

-> 

Aux  NP  VP 

.15 

NP 

-> 

Pro 

.40 

NP 

-> 

Pro 

.40 

VP 

-> 

V NP  NP 

.05 

VP 

-> 

V NP 

.40 

NP 

-> 

Nom 

.05 

NP 

-> 

Nom 

.05 

NP 

-> 

PNoun 

.35 

Nom 

-> 

PNoun  Nom 

.05 

Nom 

-> 

Noun 

.75 

Nom 

-> 

Noun 

.75 

Aux 

-> 

Can 

.40 

Aux 

-> 

Can 

.40 

NP 

-> 

Pro 

.40 

NP 

-> 

Pro 

.40 

Pro 

-> 

you 

.40 

Pro 

-> 

you 

.40 

Verb 

-> 

book 

.30 

Verb 

-> 

book 

.30 

PNoun 

-> 

TWA 

.40 

Pnoun 

-> 

TWA 

.40 

Noun 

-> 

flights 

.50 

Noun 

-> 

flights 

.50 

Figure  12.2  Two  parse  trees  for  an  ambiguous  sentence.  Parse  (a)  corre- 
sponds to  the  meaning  ‘Can  you  book  flights  on  behalf  of  TWA?’,  parse  (b)  to 
‘Can  you  book  flights  which  are  run  by  TWA’ . 


algorithm  picks  the  best  tree  for  a sentence  S out  of  the  set  of  parse  trees 
for  S (which  we'll  call  x(S)).  We  want  the  parse  tree  T which  is  most  likely 
given  the  sentence  S. 

T(S)=argmaxP{T\S)  (12.7) 

Tez(s) 

By  definition  the  probability  P(T\S)  can  be  rewritten  as  P(T.S) /P(S),  thus 
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leading  to: 

T ( S ) = argmax  P(p Tf^  (12.8) 

Tex(s) 

Since  we  are  maximizing  over  all  parse  trees  for  the  same  sentence, 
P(S)  will  be  a constant  for  each  tree,  and  so  we  can  eliminate  it: 

f{S)  = argmax  P(T,S)  (12.9) 

T£Z(S) 

Furthermore,  since  we  showed  above  that  P(T,S)  = P(T),  the  final 
equation  for  choosing  the  most  likely  parse  simplifies  to  choosing  the  parse 
with  the  highest  probability: 

f (5)  = argmax  P(r)  (12.10) 

Tez(s) 

A second  attribute  of  a PCFG  is  that  it  assigns  a probability  to  the  string 
of  words  constituting  a sentence.  This  is  important  in  language  modeling  in 
speech  recognition,  spell-correction,  or  augmentative  communication.  The 
probability  of  an  unambiguous  sentence  is  P(T,S)  = P{T)  or  just  the  prob- 
ability of  the  single  parse  tree  for  that  sentence.  The  probability  of  an  am- 
biguous sentence  is  the  sum  of  the  probabilities  of  all  the  parse  trees  for  the 
sentence: 

P(S)  = £ P(T,S)  (12.11) 

Tez(S) 

= £ P{T)  (12.12) 

Tax(S) 

An  additional  useful  feature  of  PCFGs  for  language  modeling  is  that 
they  can  assign  a probability  to  substrings  of  a sentence.  For  example,  Je- 
linek  and  Lafferty  (1991)  give  an  algorithm  for  efficiently  computing  the 
prefix  probability  of  a prefix  of  a sentence.  This  is  the  probability  that  the  grammar 
generates  a sentence  whose  initial  substring  is  w\Wi...Wi.  Stolcke  (1995) 
shows  how  the  standard  Earley  parser  can  be  augmented  to  compute  these 
prefix  probabilities,  and  Jurafsky  el  al.  (1995)  describes  an  application  of  a 
version  of  this  algorithm  as  the  language  model  for  a speech  recognizer. 
consistent  A PCFG  is  said  to  be  consistent  if  the  sum  of  the  probabilities  of  all 

sentences  in  the  language  equals  1.  Certain  kinds  of  recursive  rules  cause 
a grammar  to  be  inconsistent  by  causing  infinitely  looping  derivations  for 
some  sentences.  For  example  a rule  S — > S with  probability  1 would  lead  to 
lost  probability  mass  due  to  derivations  that  never  terminate.  See  Booth  and 
Thompson  (1973)  for  more  details  on  consistent  and  inconsistent  grammars. 
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Probabilistic  CYK  Parsing  of  PCFGs 

The  parsing  problem  for  PCFGs  is  to  produce  the  most-likely  parse  for  a 
given  sentence,  i.e.  to  compute 

f(S)  =argmaxP(T)  (12.13) 

Tex(s) 

Luckily,  the  algorithms  for  computing  the  most-likely  parse  are  sim- 
ple extensions  of  the  standard  algorithms  for  parsing.  Chapter  10  introduced 
the  use  of  the  Earley  algorithm  to  find  all  parses  for  a given  input  sentence 
and  a given  context-free  grammar.  It  is  possible  to  augment  the  Earley  al- 
gorithm to  compute  the  probability  of  each  of  its  parses,  and  thus  to  find  the 
most  likely  parse.  Instead  of  presenting  the  probabilistic  Earley  algorithm 
here,  however,  we  will  present  the  probabilistic  CYK  (Cocke- Younger- 
Kasami)  algorithm.  We  do  this  because  the  probabilistic  Earley  algorithm  is 
somewhat  complex  to  present,  and  also  because  the  CYK  algorithm  is  worth 
understanding,  and  we  haven’t  yet  studied  it.  The  reader  is  thus  referred  to 
Stolcke  (1995)  for  the  presentation  of  the  probabilistic  Earley  algorithm. 

Where  the  Earley  algorithm  is  essentially  a top-down  parser  which  uses 
a dynamic  programming  table  to  efficiently  store  its  intermediate  results,  the 
CYK  algorithm  is  essentially  a bottom-up  parser  using  the  same  dynamic 
programming  table.  The  fact  that  CYK  is  bottom-up  makes  it  more  efficient 
when  processing  lexicalized  grammars,  as  we  will  see  later. 

Probabilistic  CYK  parsing  was  first  described  by  Ney  (1991),  but  the 
version  of  the  probabilistic  CYK  algorithm  that  we  present  is  adapted  from 
Collins  (1999)  and  Aho  and  Ullman  (1972).  Assume  first  that  the  PCFG  is 
in  Chomsky  normal  form;  recall  from  page  344  that  a grammar  is  in  CNF  if 
it  is  £-free  and  if  in  addition  each  production  is  either  of  the  form  A — > B C 
or  A — > a.  The  CYK  algorithm  assumes  the  following  input,  output,  and  data 
structures: 

• Input. 

- A Chomsky  normal  form  PCFG  G = {N  .'l.P.S.D}.  Assume  that 
the  | A | nonterminals  have  indices  1,2,  ...  \N\,  and  that  the  start 
symbol  S has  index  1 . 

- n words  w i . ..wn. 

• Data  Structure.  A dynamic  programming  array  n\i.  j.a  holds  the 
maximum  probability  for  a constituent  with  nonterminal  index  a span- 
ning words  i. . . /.  Back-pointers  in  the  area  are  used  to  store  the  links 
between  constituents  in  a parse-tree. 


450 


Chapter  12.  Fexicalized  and  Probabilistic  Parsing 


• Output.  The  maximum  probability  parse  will  be  k[\  .n.  f : the  parse 
tree  whose  root  is  S and  which  spans  the  entire  string  of  words  w\...wn. 

Like  the  other  dynamic  programming  algorithms  (minimum  edit  dis- 
tance, Forward,  Viterbi,  and  Earley),  the  CYK  algorithm  fills  out  the  prob- 
ability array  by  induction.  In  this  description,  we  will  use  w,y,  to  mean  the 
string  of  words  from  word  i to  word  j,  following  Aho  and  Ullman  (1972): 

• base  case:  Consider  the  input  strings  of  length  one  (i.e.  individual 
words  Wi).  In  Chomsky  normal  form,  the  probability  of  a given  nonter- 
minal A expanding  to  a single  word  ny  must  come  only  from  the  rule 
A — > Wi  (since  A =>  vv,  if  and  only  if  A — > Wj  is  a production). 

• recursive  case:  For  strings  of  words  of  length  > 1,  A =4>  vv',-/  if  and  only 
if  there  is  at  least  one  rule  A — > BC  and  some  k.  \ < k < j,  such  that  B 
derives  the  first  k symbols  of  Wjj  and  C derives  the  last  j — k symbols 
of  Wjj . Since  each  of  these  strings  of  words  is  shorter  than  the  original 
string  Wjj,  their  probability  will  already  be  stored  in  the  matrix  n.  We 
compute  the  probability  of  Wjj  by  multiplying  together  the  probability 
of  these  two  pieces.  But  there  may  be  multiple  parses  of  Wjj,  and  so 
we’ll  need  to  take  the  max  over  all  the  possible  divisions  of  Wjj  (i.e. 
over  all  values  of  k and  over  all  possible  rules). 

Figure  12.3  gives  pseudocode  for  this  probabilistic  CYK  algorithm, 
again  adapted  from  Collins  (1999)  and  Aho  and  Ullman  (1972). 


Learning  PCFG  probabilities 


TREEBANK 


Where  do  PCFG  probabilities  come  from?  There  are  two  ways  to  assign 
probabilities  to  a grammar.  The  simplest  way  is  to  use  a corpus  of  already- 
parsed  sentences.  Such  a corpus  is  called  a treebank.  For  example  the  Penn 
Treebank  (Marcus  et  ah,  1993),  distributed  by  the  Linguistic  Data  Consor- 
tium, contains  parse  trees  for  the  Brown  Corpus,  one  million  words  from 
the  Wall  Street  Journal,  and  parts  of  the  Switchboard  corpus.  Given  a tree- 
bank,  the  probability  of  each  expansion  of  a nonterminal  can  be  computed  by 
counting  the  number  of  times  that  expansion  occurs  and  then  normalizing. 
Count(a  — > |3)  Count(a  — > P) 


P( a -»  P|a)  = 


£y  Count  ( a — » y)  Count  ( a ) 


(12.14) 


When  a treebank  is  unavailable,  the  counts  needed  for  computing  PCFG 
probabilities  can  be  generated  by  first  parsing  a corpus.  If  sentences  were 
unambiguous,  it  would  be  as  simple  as  this:  parse  the  corpus,  increment  a 
counter  for  every  rule  in  the  parse,  and  then  normalize  to  get  probabilities. 
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function  CYK(words,grammar ) returns  best-jrarse 
Create  and  clear  p[num -words, mini -words, mini -nonterminals] 

# base  case 

for  i = 1 to  num_words 

for  A = 1 to  nunuionterminals 
if  A — > Wj  is  in  grammar  then 
n[i,  i,  A]  = P (A  — » w,-) 

# recursive  case 

for  j = 2 to  mini -Words 

for  i = 1 to  num_words-j+ 1 
for  k=  1 to  7-I 

for  A = 1 to  nunuionterminals 
for  B = 1 to  nunuionterminals 
for  C = 1 to  nunuionterminals 

prob  = n[i,k,B]  x p[i+k,j-k,C]  x P(A— >BC) 
ii(prob  > 7r|z',/,  ,4] ) then 
it.[i,j,A\  = prob 
B[i,j,A]  = {k,A,B} 


Figure  12.3  The  Probabilistic  CYK  algorithm  for  finding  the  maximum 
probability  parse  of  a string  of  num -words  words  given  a PCFG  grammar  with 
mmi-rules  rules  in  Chomsky  Normal  Form.  B is  the  array  of  back-pointers 
used  to  recover  the  best  parse.  After  Collins  (1999)  and  Aho  and  Ullman 
(1972). 


However,  since  most  sentences  are  ambiguous,  in  practice  we  need  to  keep 
a separate  count  for  each  parse  of  a sentence  and  weight  each  partial  count 
by  the  probability  of  the  parse  it  appeal's  in.  The  standard  algorithm  for 
computing  this  is  called  the  Inside-Outside  algorithm,  and  was  proposed  outside 
by  Baker  (1979)  as  a generalization  of  the  forward-backward  algorithm  of 
Chapter  7.  See  Manning  and  Schiitze  (1999)  for  a complete  description  of 
the  algorithm. 


12.2  Problems  with  PCFGs 


While  probabilistic  context-free  grammars  are  a natural  extension  to  context- 
free  grammars,  they  have  a number  of  problems  as  probability  estimators. 
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Because  of  these  problems,  most  current  probabilistic  parsing  models  use 
some  augmentation  of  PCFGs  rather  than  using  vanilla  PCFGs.  This  section 
will  summarize  problems  with  PCFGs  in  modeling  structural  dependencies 
and  in  modeling  lexical  dependencies. 

One  problem  with  PCFGs  comes  from  their  fundamental  independence 
assumption.  By  definition,  a CFG  assumes  that  the  expansion  of  any  one 
nonterminal  is  independent  of  the  expansion  of  any  other  nonterminal.  This 
independence  assumption  is  carried  over  in  the  probabilistic  version;  each 
PCFG  rule  is  assumed  to  be  independent  of  each  other  rule,  and  thus  the  rule 
probabilities  arc  multiplied  together.  But  an  examination  of  the  statistics  of 
English  syntax  shows  that  sometimes  the  choice  of  how  a node  expands  is  de- 
pendent on  the  location  of  the  node  in  the  parse  tree.  For  example,  consider 
the  differential  placement  in  a sentence  of  pronouns  versus  full  lexical  noun 
phrases.  Beginning  with  Kuno  (1972),  many  linguists  have  shown  that  there 
is  a strong  tendency  in  English  (as  well  as  in  many  other  languages)  for  the 
syntactic  subject  of  a sentence  to  be  a pronoun.  This  tendency  is  caused  by 
the  use  of  subject  position  to  realize  the  ‘topic’  or  old  information  in  a sen- 
tence (Givon,  1990).  Pronouns  arc  a way  to  talk  about  old  information,  while 
non-pronominal  (‘lexical’)  noun-phrases  arc  often  used  to  introduce  new  ref- 
erents. For  example,  Francis  et  al.  (1999)  show  that  of  the  31,021  subjects  of 
declarative  sentences  in  Switchboard,  91%  arc  pronouns  (12.15a),  and  only 
9%  arc  lexical  (12.15b).  By  contrast,  out  of  the  7,489  direct  objects,  only 
34%  arc  pronouns  (12.16a),  and  66%  arc  lexical  (12.16b). 

(12.15)  (a)  She’s  able  to  take  her  baby  to  work  with  her. 

(b)  Uh,  my  wife  worked  until  we  had  a family. 

(12.16)  (a)  Some  laws  absolutely  prohibit  it. 

(b)  All  the  people  signed  confessions. 

These  dependencies  could  be  captured  if  the  probability  of  expanding 
an  NP  as  a pronoun  (for  example  via  the  rule  NP  —t  Pronoun ) versus  a lexical 
NP  (for  example  via  the  rule  NP  — > DetNoun ) were  dependent  on  whether 
the  NP  was  a subject  or  an  object.  But  this  is  just  the  kind  of  probabilistic 
dependency  that  a PCFG  does  not  allow. 

An  even  more  important  problem  with  PCFGs  is  their  lack  of  sensitiv- 
ity to  words.  Fexical  information  in  a PCFG  can  only  be  represented  via  the 
probability  of  pre-terminal  nodes  {Verb,  Noun , Det)  to  be  expanded  lexically. 
But  there  are  a number  of  other  kinds  of  lexical  and  other  dependencies  that 
turn  out  to  be  important  in  modeling  syntactic  probabilities.  For  example 
a number  of  researchers  have  shown  that  lexical  information  plays  an  im- 
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portant  role  in  selecting  the  correct  parsing  of  an  ambiguous  prepositional- 
phrase  attachment  (Ford  et  ah,  1982;  Whittemore  et  al,  1990;  Hindle  and 
Rooth,  1991,  inter  alia).  Consider  the  following  example  from  Hindle  and 
Rooth  (1991): 


(12.17)  Moscow  sent  more  than  100,000  soldiers  into  Afghanistan. . . 


Here  the  preposition  phrase  into  Afghanistan  can  be  attached  either  to 
the  NP  more  than  100,000  soldiers  or  to  the  verb-phrase  headed  by  sent. 
In  a PCFG,  the  attachment  choice  comes  down  to  the  choice  between  two 
rules:  NP  — > NPPP  (NP-attachment)  and  VP  — > NPPP  (VP-attachment). 
The  probability  of  these  two  rules  depends  on  the  training  corpus;  Hindle  and 
Rooth  (1991)  report  that  NP-attachment  happens  about  67%  compared  to 
33%  for  VP-attachment  in  13  million  words  from  the  AP  newswire;  Collins 
(1999)  reports  52%  NP-attachment  in  a corpus  containing  a mixture  of  Wall 
Street  Journal  and  I.B.M.  computer  manuals.  Whether  the  preference  is  52% 
or  67%,  crucially  in  a PCFG  this  preference  is  purely  structural  and  must  be 
the  same  for  all  verbs. 

In  (12.17),  however,  the  correct  attachment  is  to  the  verb;  in  this  case 
because  the  verb  send  subcategorizes  for  a destination,  which  can  be  ex- 
pressed with  the  preposition  into.  Indeed  all  of  the  cases  of  ambiguous  into- 
PP-attachments  with  the  main  verb  send  in  the  Penn  Treebank’s  Brown  and 
Wall  Street  Journal  corpora  attached  to  the  verb.  Thus  a model  which  kept 
separate  lexical  dependency  statistics  for  different  verbs  would  be  able  to 
choose  the  correct  parse  in  these  cases. 

Coordination  ambiguities  are  another  case  where  lexical  dependencies 
arc  the  key  to  choosing  the  proper  parse.  Figure  12.4  shows  an  example 
from  Collins  (1999),  with  two  parses  for  the  phrase  dogs  in  houses  and  cats. 
Because  dogs  is  semantically  a better  conjunct  for  cats  than  houses  (and  be- 
cause dogs  can’t  fit  inside  cats)  the  parse  [dogs  in  I ,\/p  houses  and  cats]] 
is  intuitively  unnatural  and  should  be  dispreferred.  The  two  parses  in  Fig- 
ure 12.4,  however,  have  exactly  the  same  PCFG  rules  and  thus  a PCFG  will 
assign  them  the  same  probability. 

In  summary,  probabilistic  context-free  grammars  have  a number  of  in- 
adequacies as  a probabilistic  model  of  syntax.  In  the  next  section  we  sketch 
current  methods  for  augmenting  PCFGs  to  deal  with  these  issues. 
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Conj  NP 


NP 

Noun 

dogs 


PP  and  Noun 


Noun  Prep 


Prep 

NP 

cats 

in 

Noun 

houses 

dogs  in  NP  Conj  NP 
Noun  and  Noun 
houses  cats 


Figure  12.4  An  instance  of  coordination  ambiguity.  Although  the  left 
structure  is  intuitively  the  correct  one,  a PCFG  will  assign  them  identically 
probabilities  since  both  structure  use  the  exact  same  rules.  After  Collins 
(1999) 


12.3  Probabilistic  Lexicalized  CFGs 

We  saw  in  Chapter  1 1 that  syntactic  constituents  could  be  associated  with  a 
lexical  head.  This  idea  of  a head  for  each  constituent  dates  back  to  Bloom- 
field (1914),  but  was  first  used  to  extend  PCFG  modeling  by  Black  et  al. 
(1992).  The  probabilistic  representation  of  lexical  heads  used  in  recent  parsers 
such  as  Charniak  (1997)  and  Collins  (1999)  is  simpler  than  the  complex 
head-feature  models  we  saw  in  Chapter  1 1 . In  the  simpler  probabilistic  rep- 
resentation, each  nonterminal  in  a parse-tree  is  annotated  with  a single  word 
which  is  its  lexical  head.  Figure  12.5  shows  an  example  of  such  a tree  from 
Collins  (1999),  in  which  each  nonterminal  is  annotated  with  its  head.  “Work- 
ers dumped  sacks  into  a bin”  is  a shortened  form  of  a WSJ  sentence. 

In  order  to  generate  such  a tree,  each  PCFG  rule  must  be  augmented  to 
identify  one  right-hand-side  constituent  to  be  the  head  daughter.  The  head- 
word for  a node  is  then  set  to  the  headword  of  its  head  daughter.  Choosing 
these  head  daughters  is  simple  for  textbook  examples  (NN  is  the  head  of 
NP),  but  is  complicated  and  indeed  controversial  for  most  phrases  (should 
the  complementizer  to  or  the  verb  be  the  head  of  an  infinite  verb-phrase?). 
Modern  linguistic  theories  of  syntax  generally  include  a component  that  de- 
fines heads  (see  for  example  Pollard  and  Sag,  1994).  Collins  (1999)  also 
gives  a description  of  a practical  set  of  head  rules  for  Penn  Treebank  gram- 
mar's modified  from  Magerman;  for  example  their  rule  for  finding  the  head 
of  an  NP  is  to  return  the  very  last  word  in  the  NP  if  it  is  tagged  POS  (posses- 
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S(dumped) 

NP(workers)  VP(dumped) 

|NNS(workers)  VBD(dumped)  NP(sacks)  PP(into) 

NNS (sacks)  P(into)  NP(bin) 

DT(a)  NN(bin) 

workers  dumped  sacks  into  a bin 


Figure  12.5  A lexicalized  tree  from  Collins  (1999). 


sive);  else  to  search  from  right  to  left  in  the  NP  for  the  first  child  which  is  an 
NN,  NNP,  NNPS,  NNS,  NX,  POS,  or  JJR;  else  to  search  from  left  to  right 
for  the  first  child  which  is  an  NP. 

One  way  to  think  of  these  head  features  is  as  a simplified  version  of 
the  head  features  in  a unification  grammar;  instead  of  complicated  re-entrant 
feature  values,  we  just  allow  an  attribute  to  have  a single  value  from  a finite 
set  (in  fact  the  set  of  words  in  the  vocabulary).  Technically,  grammars  in 
which  each  node  is  annotated  by  non-recursive  features  arc  called  attribute 
grammars. 

Another  way  to  think  of  a lexicalized  grammar  is  as  a simple  context- 
free  grammar  with  a lot  more  rules;  it’s  as  if  we  created  many  copies  of  each 
rule,  one  copy  for  each  possible  headword  for  each  constituent;  this  idea  of 
building  a lexicalized  grammar  is  due  to  Schabes  et  al.  (1988)  and  Schabes 
(1990).  In  general  there  may  be  too  many  such  rules  to  actually  keep  them 
around,  but  thinking  about  lexicalized  grammars  this  way  makes  it  clearer 
that  we  can  parse  them  with  standard  CFG  parsing  algorithms. 

Let’s  now  see  how  these  lexicalized  grammars  can  be  augmented  with 
probabilities,  and  how  by  doing  so  we  can  represent  the  kind  of  lexical  de- 
pendencies we  discussed  above  and  in  Chapter  9.  Suppose  we  were  to  treat  a 
probabilistic  lexicalized  CFG  like  a normal  but  huge  PCFG.  Then  we  would 
store  a probability  for  each  rule/head  combination,  as  in  the  following  con- 
trived examples: 


VP(dumped) 
VP(  dumped ) 


VBD( dumped)  NP( sacks)  PP( into)  [3  x ter10] 
VBD( dumped ) NP(cats)  PP(into)  [8xl0~n] 
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VP(dumped)  — > VBD(dumped)  NP(hats)  PP(into)  [4xlO~10] 

VP(dumped)  — > VBD(dumped)  NP(sacks)  PP(above)  [lxl()~12] 

(12.18) 

The  problem  with  this  method,  of  course,  is  that  there  is  no  corpus 
big  enough  to  train  such  probabilities.  Training  standard  PCFG  probabilities 
would  result  in  zero  counts  for  almost  all  the  rules.  To  avoid  this,  we  need  to 
make  some  simplifying  independence  assumptions  in  order  to  cluster  some 
of  the  counts. 

Perhaps  the  main  difference  between  various  modern  statistical  parsers 
lies  in  exactly  which  independence  assumptions  they  make.  In  the  rest  of  this 
section  we  describe  a simplified  version  of  Charniak’s  (1997)  parser,  but  we 
could  also  have  chosen  any  of  the  other  similar  dependency-based  statistical 
parsers  (such  as  Magerman  (1995),  Collins  (1999),  and  Ratnaparkhi  (1997)). 

Like  many  of  these  others,  Charniak’s  parser  incorporates  lexical  de- 
pendency information  by  relating  the  heads  of  phrases  to  the  heads  of  their 
constituents.  His  parser  also  incorporates  syntactic  subcategorization  infor- 
mation by  conditioning  the  probability  of  a given  rule  expansion  of  a non- 
terminal on  the  head  of  the  nonterminal.  Let’s  look  at  examples  of  slightly 
simplified  versions  of  the  two  kinds  of  statistics  (simplified  by  being  condi- 
tioned on  less  factors  than  in  Charniak’s  complete  algorithm). 

First,  recall  that  in  a vanilla  PCFG,  the  probability  of  a node  n being 
expanded  via  rule  r is  conditioned  on  exactly  one  factor:  the  syntactic  cat- 
egory of  the  node  n.  (For  simplicity  we  will  use  the  notation  n to  mean  the 
syntactic  category  of  n.)  We  will  simply  add  one  more  conditioning  factor: 
the  headword  of  the  node  h(n).  Thus  we  will  be  computing  the  probability 

p(r(n)\n.  h(n))  (12.19) 

Consider  for  example  the  probability  of  expanding  the  VP  in  Figure  12.5 
via  the  rule  r,  which  is: 

VP  -P  VBDNPPP 

This  probability  is  p(r\VP,  dumped ),  answering  the  question  “What  is 
the  probability  that  a VP  headed  by  dumped  will  be  expanded  as  VBD  NP 
PPT.  This  lets  us  capture  subcategorization  information  about  dumped ; for 
example,  a VP  whose  head  is  dumped  may  be  more  likely  to  have  an  NP  and 
a PP  than  a VP  whose  head  is  slept. 

Now  that  we  have  added  heads  as  a conditioning  factor,  we  need  to 
decide  how  to  compute  the  probability  of  a head.  The  null  assumption  would 
make  all  heads  equally  likely;  the  probability  that  the  head  of  a node  would 
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be  sacks  would  be  the  same  as  the  probability  that  the  head  would  be  racks. 
This  doesn’t  seem  very  useful.  The  syntactic  category  of  the  node  ought 
to  matter  (nouns  might  have  different  kinds  of  heads  than  verbs).  And  the 
neighboring  heads  might  matter  too.  Let’s  condition  the  probability  of  a 
node  n having  a head  h on  two  factors:  the  syntactic  category  of  the  node  n, 
and  the  head  of  the  node’s  mother  h{m(n)).  This  is  the  probability 

p(h(n)  =wordj\n,h(m(n)))  (12.20) 

Consider  for  example  the  probability  that  the  NP  that  is  the  second 
daughter  of  the  VP  in  Figure  12.5  has  the  head  sacks.  The  probability  of  this 
head  is  p(head(n)  = sacks\n  = NP.h{m{n))  = dumped).  This  probability 
answers  the  question  “What  is  the  probability  that  an  NP  whose  mother’s 
head  is  dumped  has  the  head  sacksT,  sketched  in  the  following  drawing: 

X(dumped) 

NP(?sacks?) 

The  figure  shows  that  what  this  head-probability  is  really  doing  is  cap- 
turing dependency  information  e.g.  between  the  words  dumped  and  sacks. 

How  arc  these  two  probabilities  used  to  compute  the  probability  of 
a complete  parse?  Instead  of  just  computing  the  probability  of  a parse  by 
multiplying  each  of  the  PCFG  rule  probabilities,  we  will  modify  equation 
(12.2)  by  additionally  conditioning  each  rule  on  its  head: 

P(T,S)  = Y\p{r(n)\n,h(n))  x p(h(n)\n,h(m(n)))  (12.21) 

n^T 

Let’s  look  at  a sample  parse-ambiguity  to  see  if  these  lexicalized  prob- 
abilities will  be  useful  in  disambiguation.  Figure  12.6  shows  an  alternative 
(incorrect)  parse  for  the  sentence  “Workers  dumped  sacks  into  a bin”,  again 
from  Collins  (1999).  In  this  incorrect  parse  the  PP  into  a bin  modifies  the 
NP  sacks  instead  of  the  VP  headed  by  dumped.  This  parse  is  incorrect  be- 
cause into  a bin  is  extremely  unlikely  to  be  a modifier  of  this  NP;  it  is  much 
more  likely  to  modify  dumped , as  in  the  original  parse  in  Figure  12.5. 

The  head-head  and  head-rule  probabilities  in  equation  (12.21)  will  in- 
deed help  us  correctly  choose  the  VP  attachment  (Figure  12.5)  over  the 
NP  attachment  (Figure  12.6).  One  difference  between  the  two  trees  is  that 
VP(dumped)  expands  to  VBD  NP  PP  in  the  correct  tree  and  VBDNP  in  the 
incorrect  tree.  Let’s  compute  both  of  these  by  counting  in  the  Brown  corpus 
portion  of  the  Penn  Treebank.  The  first  rule  is  quite  likely: 

p ( VP  ->  VBDNP PP\  VP,  dumped) 


S(dumped) 

NP(workers) 

VP(  dumped) 

NNS(workers)  VBD(dumped) 
l 

NP(sacks) 

NP(sacks) 

PP(into) 

NNS(sacks) 

P(into) 

NP(bin) 

DT(a)  NN(bin) 

1 

workers 

dumped 

sacks 

into 

a bin 

Figure  12.6 

(1999) 

An  incorrect  parse  of  the  sentence  in  Figure  12.5  from  Collins 

C(VP(dumped)  -A  VBDNPPP ) 

£p  C ( VP( dumped ) -A  p) 

= \ = -67  (12.22) 

The  second  rule  never  happens  in  the  Brown  corpus.  In  practice  this 
zero  value  would  be  smoothed  somehow,  but  for  now  let’s  just  notice  that  the 
first  rule  is  preferred.  This  isn’t  surprising,  since  dump  is  a verb  of  caused- 
motion  into  a new  location: 


p{VP  -A  VBDNP\VP . dumped ) 


C(VP( dumped)  -A  VBDNP ) 

£pC (VP (dumped)  — > P) 

1=0  (12.23) 


What  about  the  head  probabilities?  In  the  coiTect  parse,  a PP  node 
whose  mother’s  head  is  dumped  has  the  head  into.  In  the  incorrect,  a PP 
node  whose  mother’s  head  is  sacks  has  the  head  into.  Once  again,  let’s  use 
counts  from  the  Brown  portion  of  the  Treebank: 


p[into  | PP , dumped) 


C ( X( dumped )—>...  PP( into ) . . . ) 
C ( X( dumped )—>...  PP . . . ) 


(12.24) 
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C(X(sacks)  — » . . .PP(into) . . .) 
sacks ) — > . . . PP . . . ) 

(12.25) 

Once  again,  the  head  probabilities  correctly  predict  that  dumped  is 
more  likely  to  be  modified  by  into  than  is  sacks. 

Of  course,  one  example  does  not  prove  that  one  method  is  better  than 
another.  Furthermore,  as  we  mentioned  above,  the  probabilistic  lexical  gram- 
mar presented  above  is  a simplified  version  of  Charniak’s  actual  algorithm. 
He  adds  additional  conditioning  factors  (such  as  conditioning  the  rule-expan- 
sion probability  on  the  syncat  of  the  node’s  grandparent),  and  also  proposes 
various  backoff  and  smoothing  algorithms,  since  any  given  corpus  may  still 
be  too  small  to  acquire  these  statistics.  Other  statistical  parsers  include  even 
more  factors,  such  as  the  distinction  between  arguments  and  adjuncts  and 
giving  more  weight  to  lexical  dependencies  which  arc  closer  in  the  tree  than 
those  which  arc  further  (Collins,  1999),  the  three  left-most  parts  of  speech 
in  a given  constituent  (Magerman  and  Marcus,  1991),  and  general  structural 
preferences  (such  as  the  preference  for  right-branching  structures  in  English) 
(Briscoe  and  Carroll,  1993). 

Many  of  these  statistical  parsers  have  been  evaluated  (on  the  same  cor- 
pus) using  the  methodology  of  the  Methodology  Box  on  page  460. 

Extending  the  CYK  algorithm  to  handle  lexicalized  probabilities  is  left 
as  an  exercise  for  the  reader. 


1 


LpC(M 


= 2=7 


12.4  Dependency  Grammars 

The  previous  section  showed  that  constituent-based  grammars  could  be  aug- 
mented with  probabilistic  relations  between  head  words,  and  showed  that 
this  lexical  dependency  information  is  important  in  modeling  the  lexical 
constraints  that  heads  (such  as  verbs)  place  on  their  arguments  or  modifiers. 

An  important  class  of  grammar  formalisms  is  based  purely  on  this  lex- 
ical dependency  information  itself.  In  these  dependency  grammars,  con-  grammdaerscy 
stituents  and  phrase-structure  rules  do  not  play  any  fundamental  role.  In- 
stead, the  syntactic  structure  of  a sentence  is  described  purely  in  terms  of 
words  and  binary  semantic  or  syntactic  relations  between  these  words  (called 
lexical  dependencies),  Dependency  grammars  often  draw  heavily  from  the  pIn'denciIs 
work  of  Tesniere  (1959),  and  the  name  dependency  was  presumably  first  dependency 


Methodology  Box:  Evaluating  Parsers 


The  standard  techniques  for  evaluating  parsers  and  grammars  arc 
called  the  PARSEVAL  measures,  and  were  proposed  by  Black  et  al 
(1991)  based  on  the  same  ideas  from  signal-detection  theory  that  we 
saw  in  earlier  chapters.  In  the  simplest  case,  a particular  parsing  of 
the  test  set  (for  example  the  Penn  Treebank)  is  defined  as  the  correct 
parse.  Given  this  ‘gold  standard’  for  a test  set,  a given  constituent  in 
a candidate  parse  c of  a sentence  .s'  is  labeled  ‘correctly’  if  there  is  a 
constituent  in  the  treebank  parse  with  the  same  stalling  point,  ending 
point,  and  nonterminal  symbol.  We  can  then  measure  the  precision, 
recall,  and  a new  metric  (crossing  brackets)  for  each  sentence  .s: 


labeled  recall: 


# of  correct  constituents  in  candidate  parse  of  s 

# of  correct  constituents  in  treebank  parse  of  s 


, , , , . . # of  correct  constituents  in  candidate  parse  of  s 

labeled  precision:  = # of  total  constituents  in  candidate  parse  of  1 

cross-brackets:  the  number  of  crossed  brackets  (e.g.  the  number 
of  constituents  for  which  the  treebank  has  a bracketing  such  as 
((A  B)  C)  but  the  candidate  parse  has  a bracketing  such  as  (A 
(B  C))). 


Using  a portion  of  the  Wall  Street  Journal  treebank  as  the  test 
set,  parsers  such  as  Charniak  (1997)  and  Collins  (1999)  achieve  just 
under  90%  recall,  just  under  90%  precision,  and  about  1%  cross- 
bracketed  constituents  per  sentence. 

For  comparing  parsers  which  use  different  grammars,  the  PAR- 
SEVAL metric  includes  a canonicalization  algorithm  for  removing 
information  likely  to  be  grammar-specific  (auxiliaries,  pre-infinitival 
“to”,  etc)  and  computing  a simplified  score.  The  interested  reader 
should  see  Black  et  al.  (1991).  There  are  also  related  evaluation  met- 
rics for  dependency  parses  (Collins  et  al,  1999)  and  dependency- 
based  metrics  which  work  for  any  parse  structure  (Lin,  1995;  Carroll 
et  al,  1998). 

For  grammar-checking,  we  can  compute  instead  the  precision 
and  recall  of  a simpler  task:  how  often  the  parser  correctly  rejected 
an  ungrammatical  sentence  (or  recognized  a grammatical  sentence). 
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used  by  David  Hays.  But  this  lexical  dependency  notion  of  grammar  is  in 
fact  older  than  the  relatively  recent  phrase-structure  or  constituency  gram- 
mar's, and  has  its  roots  in  the  ancient  Greek  and  Indian  linguistic  traditions. 
Indeed  the  notion  in  traditional  grammar  of  ‘parsing  a sentence  into  subject 
and  predicate’  is  based  on  lexical  relations  rather  than  constituent  relations. 


Figure  12.7  shows  an  example  parse  of  the  sentence  I gave  him  my  ad- 
dress, using  the  dependency  grammar  formalism  of  Jarvinen  and  Tapanainen 
(1997)  and  Karlsson  et  al.  (1995).  Note  that  there  are  no  non-terminal  or 
phrasal  nodes;  each  link  in  the  parse  tree  holds  between  two  lexical  nodes 
(augmented  with  the  special  <ROOT>  node).  The  links  are  drawn  from 
a fixed  inventory  of  around  35  relations,  most  of  which  roughly  represent 
grammatical  functions  or  very  general  semantic  relations.  Other  dependency- 
based  computational  grammars,  such  as  Link  Grammar  (Sleator  and  Tem- 
perley,  1993),  use  different  but  roughly  overlapping  links.  The  following 
table  shows  a few  of  the  relations  used  in  Jarvinen  and  Tapanainen  (1997): 

Dependency  Description 
subj  syntactic  subject 

obj  direct  object  (inch  sentential  complements) 

dat  indirect  object 

pcomp  complement  of  a preposition) 

comp  predicate  nominals  (complements  of  copulas) 

tmp  temporal  adverbials 

loc  location  adverbials 

attr  premodifying  (attributive)  nominals  (genitives,  etc) 

mod  nominal  postmodifiers  (prepositional  phrases,  adjectives) 


LINK 

GRAMMAR 
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We  have  already  discussed  why  dependency  information  is  important. 
Is  there  any  advantage  to  using  only  dependency  information  and  ignoring 
constituency?  Dependency  grammar  researchers  argue  that  one  of  the  main 
advantages  of  pure  dependency  grammars  is  their  ability  to  handle  languages 
with  relatively  free  word  order.  For  example  the  word  order  in  languages 
like  Czech  is  much  more  flexible  than  in  English;  an  object  might  occur  be- 
fore or  after  a location  adverbial  or  a comp.  A phrase-structure  grammar 
would  need  a separate  rule  for  each  possible  place  in  the  parse  tree  that  such 
an  adverbial  phrase  could  occur.  A dependency  grammar  would  just  have 
one  link-type  representing  this  particular  adverbial  relation.  Thus  a depen- 
dency grammar  abstracts  away  from  word-order  variation,  representing  only 
the  information  that  is  necessary  for  the  parse. 

There  arc  a number  of  computational  implementations  of  dependency 
grammars;  Link  Grammar  (Sleator  and  Temperley,  1993)  and  Constraint 
Grammar  (Karlsson  et  ah,  1995)  arc  easily-available  broad-coverage  depen- 
dency grammars  and  parsers  for  English.  Dependency  grammars  arc  also 
often  used  for  other  languages.  Hajic  (1998),  for  example,  describes  the 
500,000  word  Prague  Dependency  Treebank  for  Czech  which  has  been  used 
to  train  probabilistic  dependency  parsers  (Collins  et  ah,  1999). 

Categorial  Grammar 

Categorial  grammars  were  first  proposed  by  Adjukiewicz  (1935),  and  mod- 
ified by  Bai'-Hillel  (1953),  Lambek  (1958),  Dowty  (1979),  Ades  and  Steed- 
man  (1982),  and  Steedman  (1989)  inter  alia.  See  Bach  (1988)  for  an  intro- 
duction and  the  other  papers  in  Oehrle  et  al.  (1988)  for  a survey  of  extensions 
to  the  basic  models.  We  will  describe  a simplified  version  of  the  combina- 
tory categorial  grammar  of  (Steedman,  1989).  A categorial  grammar  has 
two  components.  The  categorial  lexicon  associates  each  word  with  a syn- 
tactic and  semantic  category.  The  combinatory  rules  allow  functions  and 
arguments  to  be  combined.  There  arc  two  types  of  categories:  functors  and 
arguments.  Arguments,  like  nouns,  have  simple  categories  like  N.  Verbs 
or  determiners  act  more  like  functors.  For  example,  a determiner  can  be 
thought  of  as  a function  which  applies  to  a N on  its  right  to  produce  a NP. 
Such  complex  categories  arc  built  using  the  X/Y  and  X\Y  operators.  X/Y 
means  a function  from  Y to  X,  i.e.  something  which  combines  with  a Y 
on  its  right  to  produce  an  X.  Determiners  thus  receive  the  category  NP/N: 
something  which  combines  with  an  N on  its  right  to  produce  an  NP.  Similar, 
transitive  verbs  might  have  the  category  VP/NP;  something  which  combines 
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with  a NP  on  the  right  to  produce  a VP.  Ditransitive  verbs  like  give  might 
have  the  category  (VP/NP)/NP;  something  which  combines  with  a NP  on  its 
right  to  yield  a transitive  verb.  The  simplest  combination  rules  just  combine 
an  X/Y  with  a Y on  its  right  to  produce  and  X or  a X\Y  with  a Y on  its  left 
to  produce  and  X. 

Consider  the  simple  sentence  Harry  eats  apples  from  Steedman  (1989). 
Instead  of  using  a primitive  VP  category,  let’s  assume  that  a finite  verb  phrase 
like  eat  apples  has  the  category  (S\NP);  something  which  combines  with  an 
NP  on  the  left  to  produce  a sentence.  Harry  and  apples  arc  both  NPs.  Eats  is 
a finite  transitive  verb  which  combines  with  an  NP  on  the  right  to  produce  a 
finite  VP:  (S\backslashNP)/NP.  The  derivation  of  S proceeds  as  follows: 


(12.26) 


Harry  eats  apples 

NP  (S\NP)/NP  NP 
S\NP 
S 


Modern  categorial  grammars  include  more  complex  combinatory  rules 
which  arc  needed  for  coordination  and  other  complex  phenomena,  and  also 
include  composition  of  semantic  categories  as  well  as  syntactic  ones.  See 
Chapter  15  for  a discussion  of  semantic  composition,  and  the  above-mentioned 
references  for  more  details  about  categorial  grammar. 


12.5  Human  Parsing 

How  do  people  parse?  Do  we  have  evidence  that  people  use  any  of  the 
models  of  grammar  and  parsing  developed  over  the  last  4 chapters?  Do 
people  use  probabilities  to  parse?  The  study  of  human  parsing  (often  called 
human  sentence  processing)  is  a relatively  new  one,  and  we  don’t  yet  have 
complete  answers  to  these  questions.  But  in  the  last  20  years  we  have  learned 
a lot  about  human  parsing;  this  section  will  give  a brief  overview  of  some 
recent  theories.  These  results  arc  relatively  recent,  however,  and  there  is  still 
disagreement  over  the  correct  way  to  model  human  parsing,  so  the  reader 
should  take  some  of  this  with  a grain  of  salt. 

An  important  component  of  human  parsing  is  ambiguity  resolution. 
How  can  we  find  out  how  people  choose  between  two  ambiguous  parses  of 
a sentence?  As  was  pointed  out  in  this  chapter  and  in  Chapter  9,  while  al- 
most every  sentence  is  ambiguous  in  some  way,  people  rarely  notice  these 
ambiguities.  Instead,  they  only  seem  to  see  one  interpretation  for  a sentence. 


SENTENCE 

PROCESSING 


464 


Chapter  12.  Lexicalized  and  Probabilistic  Parsing 


Following  a suggestion  by  Fodor  (1978),  Ford  et  al.  (1982)  used  this  fact 
to  show  that  the  human  sentence  processor  is  sensitive  to  lexical  subcate- 
gorization preferences.  They  presented  subjects  with  ambiguous  sentences 
like  (12.27-12.28),  in  which  the  preposition  phrase  on  the  beach  could  at- 
tach either  to  a noun  phrase  ( the  dogs)  or  a verb  phrase.  They  asked  the 
subjects  to  read  the  sentence  and  check  off  a box  indicating  which  of  the  two 
interpretations  they  got  first.  The  results  arc  shown  after  each  sentence: 

(12.27)  The  women  kept  the  dogs  on  the  beach 

• The  women  kept  the  dogs  which  were  on  the  beach.  5% 

• The  women  kept  them  (the  dogs)  on  the  beach.  95% 

(12.28)  The  women  discussed  the  dogs  on  the  beach 

• The  women  discussed  the  dogs  which  were  on  the  beach.  90% 

• The  women  discussed  them  (the  dogs)  while  on  the  beach.  10% 

The  results  were  that  subjects  preferred  VP-attachment  with  keep  and 
NP-attachment  with  discuss.  This  suggests  that  keep  has  a subcategorization 
preference  for  a VP  with  three  constituents:  (VP  — > V NP  PP)  while  discuss 
has  a subcategorization  preference  for  a VP  with  two  constituents:  (VP  — > 
VNP),  although  both  verbs  still  allow  both  subcategorizations. 

Much  of  the  more  recent  ambiguity-resolution  research  relies  on  a 
garden-path  specific  class  of  temporarily  ambiguous  sentences  called  garden-path  sen- 
tences. These  sentences,  first  described  by  Bever  (1970),  arc  sentences 
which  arc  cleverly  constructed  to  have  three  properties  which  combine  to 
make  them  very  difficult  for  people  to  parse: 

1.  they  are  temporarily  ambiguous:  the  sentence  is  unambiguous,  but 
its  initial  portion  is  ambiguous. 

2.  one  of  these  two  parses  in  the  initial  portion  is  somehow  preferable  to 
the  human  parsing  mechanism. 

3.  but  the  dispreferred  parse  is  the  correct  one  for  the  sentence. 

The  result  of  these  three  properties  is  that  people  arc  ‘led  down  the 
garden  path’  towards  the  incorrect  parse,  and  then  arc  confused  when  they 
realize  it’s  the  wrong  one.  Sometimes  this  confusion  is  quite  conscious,  as  in 
Bever’s  example  (12.29);  in  fact  this  sentence  is  so  hard  to  parse  that  readers 
often  need  to  be  shown  the  correct  structure.  In  the  correct  structure  raced 
is  part  of  a reduced  relative  clause  modifying  The  horse,  and  means  ‘The 
horse  [which  was  raced  past  the  barn ] fell’;  this  structure  is  also  present  in 
the  sentence  ‘Students  taught  by  the  Berlitz  method  do  better  when  they  get 
to  France’. 
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(12.29)  The  horse  raced  past  the  barn  fell. 

(a)  S (b)  S 
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horse 
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(12.30)  The  complex  houses  married  and  single  students  and  their  families. 


(a) 


Det  Adj  N 


V 


The  complex  houses  The  complex  houses 
(12.31)  The  student  forgot  the  solution  was  in  the  back  of  the  book. 


The  students  forgot  the  solution  was 


The  students  forgot  the  solution  was 


Other  times  the  confusion  caused  by  a garden-path  sentence  is  so  subtle 
that  it  can  only  be  measured  by  a slight  increase  in  reading  time.  For  exam- 
ple in  (12.31)  from  Trueswell  el  al.  (1993)  (modified  from  an  experiment  by 
Ferreira  and  Henderson  (1991)),  readers  often  mis-parse  the  solution  as  the 
direct  object  of  forgot  rather  than  as  the  subject  of  an  embedded  sentence. 
This  is  another  subcategorization  preference  difference;  forgot  prefers  a di- 
rect object  (VP  — > VNP ) to  a sentential  complement  (VP  — > VS).  But  the 
difference  is  subtle,  and  is  only  noticeable  because  the  subjects  spent  sig- 
nificantly more  time  reading  the  word  was.  How  do  we  know  how  long  a 
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subject  takes  to  read  a word  or  a phrase?  One  way  is  by  scrolling  a sentence 
onto  a computer  screen  one  word  or  phrase  at  a time;  another  is  by  using 
an  eye-tracker  to  track  how  long  their  eyes  linger  on  each  word.  Trueswell 
et  al.  (1993)  employed  both  methods  in  separate  experiments.  This  ‘mini- 
garden-path’  effect  at  the  word  was  suggests  that  subjects  had  chosen  the 
direct  object  parse  and  had  to  re-analyze  or  rearrange  their  parse  now  that 
they  realize  they  arc  in  a sentential  complement.  By  contrast,  a verb  which 
prefers  a sentential  complement  (like  hope ) didn’t  cause  extra  reading  time 
at  was. 

These  garden-path  sentences  arc  not  just  restricted  to  English.  (12.32) 
shows  a Spanish  example  from  Gilboy  and  Sopena  (1996)  in  which  the  word 
que,  just  like  English  that,  is  ambiguous  between  the  relative  clause  marker 
and  the  sentential  complement  marker.  Thus  up  to  the  phrase  dos  hijas, 
readers  assume  the  sentence  means  “the  man  told  the  woman  that  he  had 
two  daughters”;  after  reading  the  second  que,  they  must  reparse  que  tenia 
dos  hijas  as  a relative  clause  modifier  of  mujer  rather  than  a complement  of 
dijo. 

(12.32)  El  hombre  le  dijo  a la  mujer  que  tenia  dos  hijas 

The  man  her  told  to  the  woman  that  had  two  daughters 

que  la  invitaba  a cenar. 

that  her  he  invited  to  dinner. 

‘The  man  told  the  woman  who  had  two  daughters  that  (he)  would  invite  her 

for  dinner.’ 

Example  (12.33)  shows  a Japanese  garden  path  from  Mazuka  and  Itoh 
(1995).  In  this  sentence,  up  to  the  verb  mikaketa  (saw),  the  reader  assumes 
the  sentence  means  “Yoko  saw  the  child  at  the  intersection.”  But  upon  read- 
ing the  word  mikaketa  ( taxi-DAT ),  they  they  have  to  reanalyze  child  not  as 
the  object  of  saw,  but  as  the  object  of  put-on. 

(12.33)  Yoko-ga  kodomo-o  koosaten-de  mikaketa  takusii-ni  noseta. 

Yoko-NOM  child-ACC  intersection-LOC  saw  taxi-DAT  put  on 

‘Yoko  made  the  child  ride  the  taxi  she  saw  at  the  intersection.’ 

In  the  Spanish  and  Japanese  examples,  and  in  examples  (12.29)  and 
(12.31),  the  garden  path  is  caused  by  the  subcategorization  preferences  of 
the  verbs.  The  garden-path  and  other  methodologies  have  been  employed  to 
study  many  kinds  of  preferences  besides  subcategorization  preferences.  Ex- 
ample (12.31)  from  Jurafsky  (1996)  shows  that  sometimes  these  preferences 
have  to  do  with  part-of-speech  preferences  (for  example  whether  houses  is 
more  likely  to  be  a verb  or  a noun).  Many  of  these  preferences  have  been 


Section  12.5.  Human  Parsing 


467 


shown  to  be  probabilistic  and  to  be  related  to  the  kinds  of  probabilities  we 
have  been  describing  in  this  chapter.  MacDonald  (1993)  showed  that  the 
human  processor  is  sensitive  to  whether  a noun  is  more  likely  to  be  a head 
or  a non-head  of  a constituent,  and  also  to  word-word  collocation  frequen- 
cies. Mitchell  et  al.  (1995)  showed  that  syntactic  phrase-structure  frequen- 
cies (such  as  the  frequency  of  the  relative  clause  construction)  play  a role 
in  human  processing.  Juliano  and  Tanenhaus  (1993)  showed  that  the  hu- 
man processor  is  sensitive  to  a combination  of  lexical  and  phrase-structure 
frequency. 

Besides  grammatical  knowledge,  human  parsing  is  affected  by  many 
other  factors  which  we  will  describe  later,  including  resource  constraints 
(such  as  memory  limitations,  to  be  discussed  in  Chapter  13),  thematic  struc- 
ture (such  as  whether  a verb  expects  semantic  agents  or  patients , to  be  dis- 
cussed in  Chapter  16)  and  semantic,  discourse,  and  other  contextual  con- 
straints (to  be  discussed  in  Chapter  15  and  Chapter  18).  While  there  is  gen- 
eral agreement  about  the  knowledge  sources  used  by  the  human  sentence 
processor,  there  is  less  agreement  about  the  time  course  of  knowledge  use. 
Frazier  and  colleagues  (most  recently  in  Frazier  and  Clifton,  1996)  argue 
that  an  initial  interpretation  is  built  using  purely  syntactic  knowledge,  and 
that  semantic,  thematic,  and  discourse  knowledge  only  becomes  available 
later.  This  view  is  often  called  a modularist  perspective;  researchers  hold- 
ing this  position  generally  argue  that  human  syntactic  knowledge  is  a distinct 
module  of  the  human  mind.  Many  other  researchers  (including  MacDonald, 
1994;  MacWhinney,  1987;  Pearlmutter  and  MacDonald,  1992;  Tabor  et  al, 
1997;  Trueswell  and  Tanenhaus,  1994;  Trueswell  et  al,  1994)  hold  an  inter- 
actionist  perspective,  arguing  that  people  use  multiple  kinds  of  information 
incrementally.  For  this  latter  group,  human  parsing  is  an  interactive  process, 
in  which  different  knowledge  sources  interactively  constrain  the  process  of 
interpretation. 

Researchers  such  as  MacDonald  (1993)  argue  that  these  constraints  arc 
fundamentally  probabilistic.  For  example  Jurafsky  (1996)  and  Narayanan 
and  Jurafsky  (1998)  showed  that  a probabilistic  model  which  included  PCFG 
probabilities  as  well  as  syntactic  and  thematic  subcategorization  probabili- 
ties could  account  for  garden-path  examples  such  as  those  in  (12.29-12.31) 
above.  For  example  P(N^houses)  is  greater  than  P(  V—? houses),  and  this  is 
one  of  the  factors  accounting  for  the  processing  difficulty  of  example  (12.30) 
above.  In  the  Jurafsky  and  Narayanan- Jurafsky  model,  the  human  language 
processor  takes  an  input  sentence,  and  computes  the  most-likely  interpre- 
tation by  relying  on  probabilistic  sources  of  linguistic  information.  Errors 
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(such  as  garden-path  sentences)  arc  caused  by  two  factors.  First,  the  stored 
probabilities  may  simply  not  match  the  intended  interpretation  of  the  speaker 
(i.e.  people  may  just  rank  the  wrong  interpretation  as  the  best  one).  Second, 
people  arc  unwilling  or  unable  to  maintain  very  many  interpretations  at  one 
time.  Whether  because  of  memory  limitations,  or  just  because  they  have  a 
strong  desire  to  come  up  with  a single  interpretation,  they  prune  away  low- 
ranking  interpretations.  Jurafsky  and  Narayanan-Jurafsky  suggest  that  this 
pruning  happens  via  probabilistic  beam  search  in  the  human  parser  (like  the 
pruning  described  in  Chapter  7).  The  result  is  that  they  prune  away  the  cor- 
rect interpretation,  leaving  the  highest-scoring  but  incorrect  one. 


12.6  Summary 

This  chapter  has  sketched  the  basics  of  probabilistic  parsing,  concentrat- 
ing on  probabilistic  context-free  grammars  and  probabilistic  lexicalized 
grammars. 

• Probabilistic  grammar's  assign  a probability  to  a sentence  or  string  of 
words,  while  attempting  to  capture  more  sophisticated  syntactic  infor- 
mation than  the  A-gram  grammar's  of  Chapter  6, 

• A probabilistic  context-free  grammar  (PCFG)  is  a context-free  gram- 
mar in  which  every  rule  is  annotated  with  the  probability  of  choosing 
that  rule.  Each  PCFG  rule  is  treated  as  if  it  were  conditionally  inde- 
pendent; thus  the  probability  of  a sentence  is  computed  by  multiply- 
ing the  probabilities  of  each  rule  in  the  parse  of  the  sentence. 

• The  Cocke- Younger-Kasami  (CYK)  algorithm  is  a bottom-up  dy- 
namic programming  parsing  algorithm.  Both  the  CYK  and  Earley  can 
be  augmented  to  compute  the  probability  of  a parse  while  they  are 
parsing  a sentence. 

• PCFG  probabilities  can  be  learning  by  counting  in  a parsed  corpus,  or 
by  parsing  a corpus.  The  Inside-Outside  algorithm  is  a way  of  dealing 
with  the  fact  that  the  sentences  being  parsed  are  ambiguous. 

• Probabilistic  lexicalized  CFGs  augment  PCFGs  with  a lexical  head 
for  each  rule.  The  probability  of  a rule  can  then  be  conditioned  on  the 
lexical  head  or  nearby  heads. 

• Parsers  are  evaluated  using  three  metrics:  labeled  recall,  labeled  pre- 
cision, and  cross-brackets. 
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• There  is  evidence  based  on  garden-path  sentences  and  other  on-line 
sentence-processing  experiments  that  the  human  parser  operates  prob- 
abilistically and  uses  probabilistic  grammatical  knowledge  such  as  sub- 
categorization information. 
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Bibliographical  and  Historical  Notes 


Many  of  the  formal  properties  of  probabilistic  context-free  grammars  were 
first  worked  out  by  Booth  (1969)  and  Salomaa  (1969).  Baker  (1979)  pro- 
posed the  Inside-Outside  algorithm  for  unsupervised  training  of  PCFG  prob- 
abilities, which  used  a CYK-style  parsing  algorithm  to  compute  inside  prob- 
abilities. Jelinek  and  Lafferty  (1991)  extended  the  CYK  algorithm  to  com- 
pute probabilities  for  prefixes.  Stolcke  (1995)  drew  on  both  these  algorithm 
to  adopt  the  Earley  algorithm  to  PCFGs. 

A number  of  researchers  starting  in  the  early  1990’s  worked  on  adding 
lexical  dependencies  to  PCFGs,  and  on  making  PCFG  probabilities  more 
sensitive  to  surrounding  syntactic  structure.  Many  of  these  papers  were  first 
presented  at  the  DARPA  Speech  and  Natural  Language  Workshop  in  June, 
1990.  A paper  by  Hindle  and  Rooth  (1990)  applied  lexical  dependencies 
to  the  problem  of  attaching  preposition  phrases;  in  the  question  session  to 
a later  paper  Ken  Church  suggested  applying  this  method  to  full  parsing 
(Marcus,  1990).  Early  work  on  such  probabilistic  CFG  parsing  augmented 
with  probabilistic  dependency  information  includes  Magerman  and  Marcus 
(1991),  Black  etal.  (1992),  Jones  and  Eisner  (1992),  Bod  (1993),  and  Jelinek 
et  al.  (1994),  in  addition  to  Collins  (1996),  Charniak  (1997),  and  Collins 
(1999)  discussed  above. 

Probabilistic  formulations  of  grammar  other  than  PCFGs  include  prob- 
abilistic TAG  grammar  (Resnik,  1992;  Schabes,  1992),  based  on  the  TAG 
grammars  discussed  in  Chapter  9,  probabilistic  LR  parsing  (Briscoe  and 
Carroll,  1993),  and  probabilistic  link  grammar  (Lafferty  et  al,  1992).  An 
approach  to  probabilistic  parsing  called  supertagging  extends  the  part-of- 
speech  tagging  metaphor  to  parsing  by  using  very  complex  tags  that  arc  in 
fact  fragments  of  lexicalized  parse  trees  (Bangalore  and  Joshi,  1999;  Joshi 
and  Srinivas,  1994),  based  on  the  lexicalized  TAG  grammars  of  Schabes 
et  al.  (1988).  For  example  the  noun  purchase  would  have  a different  tag 
as  the  first  noun  in  a noun  compound  (where  it  might  be  on  the  left  of  a 
small  tree  dominated  by  Nominal)  than  as  the  second  noun  (where  it  might 
be  on  the  right).  See  Goodman  (1997)  and  Abney  (1997)  for  probabilis- 
tic treatments  of  feature-based  grammars.  Another  approach  combines  the 
finite-state  model  of  parsing  described  in  Chapter  9 with  the  /V-gram.  by 
doing  partial  parsing  and  then  computing  N- grams  over  basic  phrases  (e.g. 
P(PP\NP)).  (Moore  et  al,  1995;  Zechner  and  Waibel,  1998).  A number 
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of  probabilistic  parsers  arc  based  on  dependency  grammars;  see  for  exam- 
ple Chelba  et  al.  (1997),  Chelba  and  Jelinek  (1998),  and  Berger  and  Printz 
(1998);  these  parsers  were  also  used  as  language  models  for  speech  recogni- 
tion. 

Related  to  probabilistic  dependency  grammars  is  the  idea  of  learning 
subcategorization  frames  for  verbs,  as  well  as  probabilities  for  these  frames. 
Algorithms  which  learn  non-probabilistic  subcategorization  frames  for  verbs 
include  the  cue-based  approach  of  Brent  (1993)  and  the  finite-state  automa- 
ton approach  of  Manning  (1993).  Briscoe  and  Carroll  (1997)  extract  more 
complex  subcategorization  frames  (using  160  possible  subcategorization  la- 
bels) and  also  learns  subcategorization  frame  frequencies,  using  a probabilis- 
tic LR  parser  and  some  post-processing.  Roland  and  Jurafsky  (1998)  showed 
that  it  is  important  to  compute  subcategorization  probabilities  for  the  word 
sense  (‘lemma’)  rather  than  the  simple  orthographic  word. 

Many  probabilistic  and  corpus-based  approaches  have  been  taken  to 
the  preposition-phrase  attachment  problem  since  Hindle  and  Rooth's  study, 
including  TBL  (Brill  and  Resnik,  1994),  Maximum  Entropy  (Ratnaparkhi 
et  al,  1994),  Memory-Based  Learning  (Jakub  and  Daelemans,  1997),  log- 
linear  models  (Franz,  1997),  and  decision  trees  using  semantic  distance  be- 
tween heads  (computed  from  WordNet)  (Stetina  and  Nagao,  1997),  as  well 
as  the  use  of  machine  learning  techniques  like  boosting  (Abney  et  al,  1999). 

Manning  and  Schiitze  (1999)  is  a good  advanced  textbook  on  statisti- 
cal natural  language  processing  which  covers  probabilistic  parsing.  Collins’ 
(1999)  dissertation  includes  a very  readable  survey  of  the  field  and  introduc- 
tion to  his  parser. 


Exercises 

12.1  Implement  the  CYK  algorithm. 

12.2  Sketch  out  how  the  CYK  algorithm  would  have  to  be  augmented  to 
handle  lexicalized  probabilities. 


12.3  Implement  your  lexicalized  extension  of  the  CYK  algorithm. 
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12.4  Implement  your  lexicalized  extension  of  the  CYK  algorithm. 

12.5  Implement  the  PARSEVAL  metrics  described  on  page  460.  Next  ei- 
ther use  a treebank  or  create  your  own  hand-checked  parsed  testset.  Now  use 
your  CFG  (or  other)  parser  and  grammar  and  parse  the  testset  and  compute 
labeled  recall,  labeled  precision,  and  cross-brackets. 

12.6  Take  any  three  sentences  from  Chapter  9 and  hand-parse  them  into 
the  dependency  grammar  formalism  of  Karlsson  et  al.  (1995)  shown  on  page 
461. 


LANGUAGE  AND 
COMPLEXITY 


This  is  the  dog,  that  worried  the  cat,  that  killed  the  rat,  that  ate 
the  malt,  that  lay  in  the  house  that  Jack  built. 

Mother  Goose,  The  house  that  Jack  built 


This  is  the  malt  that  the  rat  that  the  cat  that  the  dog  worried 
killed  ate. 

Victor  H.  Yngve  (1960) 


Much  of  the  humor  in  musical  comedy  and  comic  operetta  comes  from 
entwining  the  main  characters  in  fabulously  complicated  plot  twists.  Casilda, 
the  daughter  of  the  Duke  of  Plaza-Toro  in  Gilbert  and  Sullivan’s  The  Gon- 
doliers, is  in  love  with  her  father’s  attendent  Luiz.  Unfortunately,  Casilda 
discovers  she  has  already  been  married  (by  proxy)  as  a babe  of  six  months  to 
“the  infant  son  and  heir  of  His  Majesty  the  immeasurably  wealthy  King  of 
Barataria”.  It  is  revealed  that  this  infant  son  was  spirited  away  by  the  Grand 
Inquisitor  and  raised  by  a “highly  respectable  gondolier”  in  Venice  as  a gon- 
dolier. The  gondolier  had  a baby  of  the  same  age  and  could  never  remember 
which  child  was  which,  and  so  Casilda  was  in  the  unenviable  position,  as 
she  puts  it,  of  “being  married  to  one  of  two  gondoliers,  but  it  is  impossible 
to  say  which”.  By  way  of  consolation,  the  Grand  Inquisitor  informs  her  that 
“such  complications  frequently  occur”. 

Luckily,  such  complications  don’t  frequently  occur  in  natural  language. 
Or  do  they?  In  fact  there  arc  sentences  that  arc  so  complex  that  they  arc  hard 
to  understand,  such  as  Yngve ’s  sentence  above,  or  the  sentence: 

“ The  Republicans  who  the  senator  who  she  voted  for  chastised 
were  trying  to  cut  all  benefits  for  veterans 
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Studying  such  sentences,  and  more  generally  understanding  what  level  of 
complexity  tends  to  occur  in  natural  language,  is  an  important  area  of  lan- 
guage processing.  Complexity  plays  an  important  role,  for  example,  in  de- 
ciding when  we  need  to  use  a particular  formal  mechanism.  Formal  mecha- 
nisms like  finite  automata,  Markov  models,  transducers,  phonological  rewrite 
power  rules,  and  context-free  grammars,  can  be  described  in  terms  of  their  power, 
complexity  or  equivalently  in  terms  of  the  complexity  of  the  phenomena  that  they  can 
describe.  This  chapter  introduces  the  Chomsky  hierarchy,  a theoretical  tool 
that  allows  us  to  compare  the  expressive  power  or  complexity  of  these  dif- 
ferent formal  mechanisms.  With  this  tool  in  hand,  we  summarize  arguments 
about  the  correct  formal  power  of  the  syntax  of  natural  languages,  in  particu- 
lar English  but  also  including  a famous  Swiss  dialect  of  German  that  has  the 
interesting  syntactic  property  called  cross-serial  dependencies.  This  prop- 
erty has  been  used  to  argue  that  context-free  grammars  arc  insufficiently 
powerful  to  model  the  morphology  and  syntax  of  natural  language. 

In  addition  to  using  complexity  as  a metric  for  understanding  the  rela- 
tion between  natural  language  and  formal  models,  the  field  of  complexity  is 
also  concerned  with  what  makes  individual  constructions  or  sentences  hard 
to  understand.  For  example  we  saw  above  that  certain  nested  or  center- 
embedded  sentences  arc  difficult  for  people  to  process.  Understanding  what 
makes  some  sentences  difficult  for  people  to  process  is  an  important  paid  of 
understanding  human  parsing. 

13.1  The  Chomsky  Hierarchy 

How  arc  automata,  context-free  grammars,  and  phonological  rewrite  rules 
related?  What  they  have  in  common  is  that  each  describes  a formal  lan- 
guage, which  we  have  seen  is  a set  of  strings  over  a finite  alphabet.  But  the 
kind  of  grammars  we  can  write  with  each  of  these  formalism  arc  of  different 
pow!rative  generative  power.  One  grammar  is  of  greater  generative  power  or  complex- 
ity than  another  if  it  can  define  a language  that  the  other  cannot  define.  We 
will  show,  for  example,  that  a context-free  grammar  can  be  used  to  describe 
formal  languages  that  cannot  be  described  with  a finite  state  automaton. 

It  is  possible  to  construct  a hierarchy  of  grammars,  where  the  set  of 
languages  describable  by  grammars  of  greater  power  subsumes  the  set  of 
languages  describable  by  grammars  of  lesser  power.  There  arc  many  possi- 
ble such  hierarchies;  the  one  that  is  most  commonly  used  in  computational 
hierarchy  linguistics  is  the  Chomsky  hierarchy  (Chomsky,  1959a),  which  includes 
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four  kinds  of  grammars,  characterized  graphically  in  Figure  13.1. 


Type  0 Languages 


Context-Sensitive  Languages 


Context-Free  Languages  (with  no  epsilon  productions) 


Regular  (or  Right  Linear)  Languages 


Figure  13.1  A Venn  diagram  of  the  languages  on  the  Chomsky  Hierarchy 


What  is  perhaps  not  intuitively  obvious  is  that  the  decrease  in  the  gen- 
erative power  of  languages  from  the  most  powerful  to  the  weakest  can  be 
accomplished  merely  by  placing  constraints  on  the  way  the  grammar  rules 
are  allowed  to  be  written.  The  following  table  shows  the  four  types  of  gram- 
mar's in  the  Chomsky  hierarchy,  defined  by  the  constraints  on  the  form  that 
rules  must  take.  In  these  examples,  A is  a single  non-terminal,  and  a,  p,  and 
y are  arbitrary  strings  of  terminal  and  non-terminal  symbols.  They  may  be 
empty  unless  this  is  specifically  disallowed  below,  x is  an  arbitrary  string  of 
terminal  symbols. 


Type 

Common  Name 

Rule  Skeleton 

Linguistic  Example 

0 

Turing  Equivalent 

a ->  p,  s.t.  a/£ 

ATNs 

1 

Context  Sensitive 

aAP  ->  ayP,  s.t.  y / £ 

Tree- Adjoining  Grammars 

2 

Context  Free 

A — > y 

Phrase  Structure  Grammars 

3 

Regular 

A — > xB  or  A — > x 

Finite  State  Automata 

Figure  13.2  The  Chomsky  Hierarchy 

Type  0 or  unrestricted  grammars  have  no  restrictions  on  the  form 
of  their  rules,  except  that  the  left-hand  side  cannot  be  the  empty  string  £. 
Any  (non-null)  string  can  be  written  as  any  other  string  (or  as  £).  Type  0 
grammars  characterize  the  recursively  enumerable  languages,  i.e.,  those 
whose  strings  can  be  listed  (enumerated)  by  a Turing  Machine. 

Context-sensitive  grammars  have  rules  that  rewrite  a non-terminal 


RECURSIVELY 

ENUMERABLE 


CONTEXT- 

SENSITIVE 
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CONTEXT- 

FREE 


RIGHT-LINEAR 

LEFT-LINEAR 


symbol  A in  the  context  aA|3  as  any  non-empty  string  of  symbols.  They  can 

be  either  written  in  the  form  aA|3  — > ay  [3  or  in  the  form  A — > y/a (3.  We 

have  seen  this  latter  version  in  the  Chomsky-Halle  representation  of  phono- 
logical rules  (Chomsky  and  Halle,  1968)  , as  the  following  rule  of  Flapping 
demonstrates: 

III  [dx]  / V V 

While  the  form  of  these  rules  seems  context-sensitive.  Chapter  4 showed 
that  phonological  rule  systems  that  do  not  have  recursion  arc  actually  equiv- 
alent in  power  to  the  regular  grammars.  A linguistic  model  that  is  known  to 
be  context-sensitive  is  the  Tree-Adjoining  Grammar  (Joshi,  1985). 

Another  way  of  conceptualizing  a rule  in  a context-sensitive  grammar 
is  as  rewriting  a string  of  symbols  5 as  another  string  of  symbols  (f)  in  a 
“non-decreasing”  way;  such  that  (f)  has  at  least  as  many  symbols  as  5. 

We  studied  context-free  grammars  in  Chapter  9.  Context-free  rules 
allow  any  single  nonterminal  to  be  rewritten  as  any  string  of  terminals  and 
nonterminals.  A nonterminal  may  also  be  rewritten  as  8,  although  we  didn’t 
make  use  of  this  option  in  Chapter  9. 

Regular  grammars  arc  equivalent  to  regular  expressions.  That  is,  a 
given  regular  language  can  be  characterized  either  by  a regular  expression 
of  the  type  we  discussed  in  Chapter  2,  or  by  a regular  grammar.  Regular 
grammars  can  either  be  right-linear  or  left-linear.  A rule  in  a right-linear 
grammar  has  a single  non-terminal  on  the  left,  and  at  most  one  non-terminal 
on  the  right-hand  side.  If  there  is  a non-terminal  on  the  right-hand  side, 
it  must  be  the  last  symbol  in  the  string.  The  right-hand-side  of  left-linear 
grammars  is  reversed  (the  right-hand-side  must  start  with  (at  most)  a single 
non-terminal).  All  regular  languages  have  both  a left-linear  and  a right-linear 
grammar.  For  the  rest  of  our  discussion,  we  will  consider  only  the  right- 
linear  grammars. 

For  example,  consider  the  following  regular  (right-linear)  grammar: 

5 -4  aA 

S ->•  bB 

A -4  aS 

B ->  bbS 

S ^ 8 


It  is  regular,  since  the  left-hand-side  of  each  rule  is  a single  non-terminal 
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and  each  right-hand  side  has  at  most  one  (rightmost)  non-terminal.  Here  is  a 
sample  derivation  in  the  language: 

S =>  aA  =4>  aaS  =>  cicibB  =>  aabbbS  = > aabbbctA  =>  aabbbaaS  =>  aabbbaa 

We  can  see  that  each  time  S expands,  it  produces  either  aaS  or  bbbS\ 
thus  the  reader  should  convince  themself  that  this  language  corresponds  to 
the  regular  expression  (aaUbbb)*. 

We  will  not  present  the  proof  that  a language  is  regular  if  and  only  if  it 
is  generated  by  a regular  language;  it  was  first  proved  by  Chomsky  and  Miller 
(1958)  and  can  be  found  in  textbooks  like  Hopcroft  and  Ullman  (1979)  and 
Lewis  and  Papadimitriou  (1981).  The  intuition  is  that  since  the  nonterminals 
arc  always  at  the  right  or  left  edge  of  a rule,  they  can  be  processed  iteratively 
rather  than  recursively. 


13.2  HOW  TO  TELL  IF  A LANGUAGE  ISN’T  REGULAR 

How  do  we  know  which  type  of  rules  to  use  for  a given  problem?  Could  we 
use  regular  expressions  to  write  a grammar  for  English?  Our  do  we  need  to 
use  context-free  rules  or  even  context-sensitive  rules?  It  turns  out  that  for 
formal  languages  there  arc  methods  for  deciding  this.  That  is,  we  can  say  for 
a given  formal  language  whether  it  is  representable  by  a regular  expression, 
or  whether  it  instead  requires  a context-free  grammar,  and  so  on. 

So  if  we  want  to  know  if  some  paid  of  natural  language  (the  phonol- 
ogy of  English,  let’s  say,  or  perhaps  the  morphology  of  Turkish)  is  repre- 
sentable by  a certain  class  of  grammars,  we  need  to  find  a formal  language 
that  models  the  relevant  phenomena  and  figure  out  which  class  of  grammars 
is  appropriate  for  this  formal  language. 

Why  should  we  care  whether  (say)  the  syntax  of  English  is  repre- 
sentable by  a regular  language?  One  main  reason  is  that  we’d  like  to  know 
which  type  of  rule  to  use  in  writing  computational  grammars  for  English. 
If  English  is  regular,  we  would  write  regular  expressions,  and  use  efficient 
automata  to  process  the  rules.  If  English  is  context-free,  we  would  write 
context-free  rules  and  use  the  Earley  algorithm  to  parse  sentences,  and  so 
on. 

Another  reason  to  care  is  that  it  tells  us  something  about  the  formal 
properties  of  different  aspects  of  natural  language;  it  would  be  nice  to  know 
where  a language  ‘keeps’  its  complexity;  whether  the  phonological  system 
of  a language  is  simpler  than  the  syntactic  system,  or  whether  a certain 
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LEMMA 


kind  of  morphological  system  is  inherently  simpler  than  another  kind.  It 
would  be  a strong  and  exciting  claim,  for  example,  if  we  could  show  that  the 
phonology  of  English  was  capturable  by  a finite-state  machine  rather  than 
the  context-sensitive  rules  that  arc  traditionally  used;  it  would  mean  that  En- 
glish phonology  has  quite  simple  formal  properties.  Indeed,  this  fact  was 
shown  by  Johnson  (1972),  and  helped  lead  to  the  modern  work  in  finite-state 
methods  shown  in  Chapter  3 and  Chapter  4. 

The  Pumping  Lemma 

The  most  common  way  to  prove  that  a language  is  regular  is  to  actually 
build  a regular  expression  for  the  language.  In  doing  this  we  can  rely  on 
the  fact  that  the  regular  languages  arc  closed  under  union,  concatenation, 
Kleene  star,  complementation,  and  intersection.  We  saw  examples  of  union, 
concatenation,  and  Kleene  star  in  Chapter  2.  So  if  we  can  independently 
build  a regular  expression  for  two  distinct  parts  of  a language,  we  can  use  the 
union  operator  to  build  a regular  expression  for  the  whole  language,  proving 
that  the  language  is  regular. 

Sometimes  we  want  to  prove  that  a given  language  is  not  regular.  An 
extremely  useful  tool  for  doing  this  is  the  Pumping  Lemma.  There  arc  two 
intuitions  behind  this  lemma  (our  description  of  the  pumping  lemma  draws 
from  Lewis  and  Papadimitriou  (1981)  and  Hopcroft  and  Ullman  (1979)). 
First,  if  a language  can  be  modeled  by  a finite  automaton,  we  must  be  able 
to  decide  with  a bounded  amount  of  memory  whether  any  string  was  in  the 
language  or  not.  This  amount  of  memory  can’t  grow  larger  for  different 
strings  (since  a given  automaton  has  a fixed  number  of  states).  Thus  the 
memory  needs  must  not  be  proportional  to  the  length  of  the  input.  This 
means  for  example  that  languages  like  anbn  arc  not  likely  to  be  regular,  since 
we  would  need  some  way  to  remember  what  n was  in  order  to  make  sure  that 
there  were  an  equal  number  of  as  and  b' s.  The  second  intuition  relies  on  the 
fact  that  if  a regular  language  has  any  long  strings  (longer  than  the  number 
of  states  in  the  automaton),  there  must  be  some  sort  of  loop  in  the  automaton 
for  the  language.  We  can  use  this  fact  by  showing  that  if  a language  doesn’t 
have  such  a loop,  then  it  can’t  be  regular. 

Let’s  consider  a language  L and  the  corresponding  deterministic  FSA 
M , which  has  N states.  Consider  an  input  string  also  of  length  N.  The 
machine  starts  out  in  state  r/o;  after  seeing  1 symbol  it  will  be  in  state  q\ ; 
after  N symbols  it  will  be  in  state  qn . In  other  words,  a string  of  length  N 
will  go  through  N + 1 states  (from  qo  to  r/y).  But  there  are  only  N states 
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in  the  machine.  This  means  that  at  least  2 of  the  states  along  the  accepting 
path  (call  them  q,  and  qj)  must  be  the  same.  In  other  words,  somewhere 
on  an  accepting  path  from  the  initial  to  final  state,  there  must  be  a loop. 
Figure  13.3  shows  an  illustration  of  this  point.  Let  x be  the  string  of  symbols 
that  the  machine  reads  on  going  from  the  initial  state  qo  to  the  beginning  of 
the  loop  qj.  y is  the  string  of  symbols  that  the  machine  reads  in  going  through 
the  loop,  z is  the  string  of  symbols  from  the  end  of  the  loop  (qj)  to  the  final 
accepting  state  (q^). 


The  machine  accepts  the  concatenation  of  these  three  strings  of  sym- 
bols, i.e.  xyz . But  if  the  machine  accepts  xyz  it  must  accept  xzl  This  is 
because  the  machine  could  just  skip  the  loop  in  processing  xz.  Furthermore, 
the  machine  could  also  go  around  the  loop  any  number  of  times;  thus  it  must 
also  accept  xyyz,  xyyyz,  xyyyyz,  etc.  In  fact,  it  must  accept  any  string  of  the 
form  xynz  for  n > 0. 

The  version  of  the  pumping  lemma  we  give  is  a simplified  one  for 
infinite  regular  languages;  stronger  versions  can  be  stated  that  also  apply  to 
finite  languages,  but  this  one  gives  the  flavor  of  this  class  of  lemmas: 

Pumping  Lemma.  Let  L be  an  infinite  regular  language.  Then 

there  arc  strings  x,  y,  and  z,  such  that  y / £ and  xynz  € L for  n > 0. 

The  pumping  lemma  states  that  if  a language  is  regular,  then  there  is 
some  string  y that  can  be  ‘pumped’  appropriately.  But  this  doesn’t  mean  that 
if  we  can  pump  some  string  y,  the  language  must  be  regular.  Non-regular 
languages  may  also  have  strings  that  can  be  pumped.  Thus  the  lemma  is  not 
used  for  showing  that  a language  is  regular.  Rather  it  is  used  for  showing 
that  a language  isn ’t  regular,  by  showing  that  in  some  language  there  is  no 
possible  string  that  can  be  pumped  in  the  appropriate  way. 

Let’s  use  the  pumping  lemma  to  show  that  the  language  anbn  (i.e.  the 
language  consisting  of  strings  of  as  followed  by  an  equal  number  of  bs ) is 
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not  regular.  We  must  show  that  any  possible  string  s that  we  pick  cannot  be 
divided  up  into  three  parts  x,  y,  and  z such  that  y can  be  pumped.  Given  a 
random  string  s from  a" b'\  we  can  distinguish  three  ways  of  breaking  s up, 
and  show  that  no  matter  which  way  we  pick,  we  cannot  find  some  y that  can 
be  pumped: 

1.  y is  composed  only  of  as.  (This  implies  that  x is  all  as  too,  and  z 
contains  all  the  bs.  perhaps  preceded  by  some  as.)  But  if  y is  all  as, 
that  means  xynz  has  more  as  than  xyz.  But  this  means  it  has  more  as 
than  bs,  and  so  cannot  be  a member  of  the  language  a"b" ! 

2.  y is  composed  only  of  bs.  The  problem  here  is  similar  to  case  1;  If  y 
is  all  bs,  that  means  xynz  has  more  bs  than  xyz,  and  hence  has  more  bs 
than  as. 

3.  y is  composed  of  both  as  and  bs  (this  implies  that  x is  only  as,  while 
z is  only  bs).  This  means  that  xynz  must  have  some  bs  before  as,  and 
again  cannot  be  a member  of  the  language  a" b" ! 

Thus  there  is  no  string  in  anb"  that  can  be  divided  into  x,  y,  z in  such  a 
way  that  y can  be  pumped,  and  hence  an b"  is  not  a regular  language. 

But  while  a"bn  is  not  a regular  language,  it  is  a context-free  language. 
In  fact,  the  context-free  grammar  that  models  a" b"  only  takes  two  rules! 
Here  they  arc: 

5 ->  a S b 
S ->•  £ 


Here’s  a sample  parse  tree  using  this  grammar  to  derive  the  sentence 

aabb: 
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There  is  also  a pumping  lemma  for  context-free  languages,  that  can  be 
used  whether  or  not  a language  is  context-free;  complete  discussions  can  be 
found  in  Hopcroft  and  Ullman  (1979)  and  Partee  (1990). 

Are  English  and  other  Natural  Languges  Regular  Languages? 

“ How’s  business?”  I asked. 

“Lousy  and  terrible.”  Fritz  grinned  richly.  “Or  I pull  off  a new 
deal  in  the  next  month  or  I go  as  a gigolo,” 

“Either ..  .or ...,’  I corrected,  from  force  of  professional  habit. 

“Fm  speaking  a lousy  English  just  now,”  drawled  Fritz,  with 
great  self-satisfaction.  “Sally  says  maybe  she’ll  give  me  a few 
lessons.” 

Christopher  Isherwood.  1935.  “Sally  Bowles”  from 
Goodbye  to  Berlin 

The  pumping  lemma  provides  us  with  the  theoretical  machinery  for 
understanding  the  well-known  arguments  that  English  (or  rather  ‘the  set  of 
strings  of  English  words  considered  as  a formal  language’)  is  not  a regular 
language. 

The  first  such  argument  was  given  by  Chomsky  (1956)  and  Chomsky 
(1957).  He  first  considers  the  language  {xx^.x  E a.b*}.  xR  means  ‘the  re- 
verse of  x’,  so  each  sentence  of  this  language  consists  of  a string  of  as  and  b s 
followed  by  the  reverse  or  ‘mirror  image’  of  the  string.  This  language  is  not 
regular;  Partee  (1990)  shows  this  by  intersecting  it  with  the  regular  language 
aa*bbaa*.  The  resulting  language  is  anb2an\  it  is  left  as  an  exercise  for  die 
reader  (Exercise  13.3)  to  show  that  this  is  not  regular  by  the  pumping  lemma. 

Chomsky  then  showed  that  a particular  subset  of  the  grammar  of  En- 
glish is  isomorphic  to  the  mirror  image  language.  He  has  us  consider  the  fol- 
lowing English  syntactic  structures,  where  Si,S2  ■ ■ -Sn,  are  declarative  sen- 
tences in  English: 

• If  Si,  then  S2 

• Either  S3,  or  S4 

• The  man  who  said  S5  is  arriving  today 

Clearly,  Chomsky  points  out,  these  are  English  sentences.  Further- 
more, in  each  case  there  is  a lexical  dependency  between  one  paid  of  each 
structure  and  another.  “If’  must  be  followed  by  “then”  (and  not,  for  example, 
“or”).  “Either”  must  be  followed  by  “or”  (and  not,  for  example,  “because”). 
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CENTER- 

EMBEDDED 


Now  these  sentences  can  be  embedded  in  English,  one  in  another;  for 
example,  we  could  build  sentences  like  the  following: 

If  either  the  man  who  said  S5  is  arriving  today  or  the  man  who 
said  S5  is  arriving  tomorrow,  then  the  man  who  said  .SV,  is  arriving 
the  day  after. . . 

The  regular  languages  arc  closed  under  substitution  or  homomorphism; 
this  just  means  that  we  can  rename  any  of  the  symbols  in  the  above  sentences. 
Let’s  introduce  the  following  substitution: 


if  ->•  a 

then  — >•  a 

either  — >•  b 

or  — > b 

other  words  — > £ 


Now  if  we  apply  this  substitution  to  the  sentence  above,  we  get  the 
following  sentence: 

abba 

This  sentence  has  just  the  mirror-like  property  that  we  showed  above 
was  not  capturable  by  finite-state  methods.  If  we  assume  that  if,  then,  either, 
or,  can  be  nested  indefinitely,  then  English  is  isomorphic  to  XXs, x £ a.h*, 
and  hence  is  not  a regular  language.  Of  course,  it’s  not  true  that  these  struc- 
tures can  be  nested  indefinitely  (sentences  like  this  get  hard  to  understand 
after  a couple  nestings);  we  will  return  to  this  issues  in  Section  13.4. 

Partee  (1990)  gave  a second  proof  that  English  is  not  a regular  lan- 
guage. This  proof  is  based  on  a famous  class  of  sentences  with  center- 
embedded  structures  (Yngve,  1960);  here  is  a valiant  of  these  sentences: 

The  cat  likes  tuna  fish. 

The  cat  the  dog  chased  likes  tuna  fish. 

The  cat  the  dog  the  rat  bit  chased  likes  tuna  fish. 

The  cat  the  dog  the  rat  the  elephant  admired  bit  chased  likes  tuna  fish. 

As  was  true  with  the  either/or  sentences  above,  these  sentences  get 
harder  to  understand  as  they  get  more  complex.  But  for  now,  let’s  assume 
that  the  grammar  of  English  allows  an  indefinite  number  of  embeddings. 
Then  in  order  to  show  that  English  is  not  regular,  we  need  to  show  that 
sentences  like  these  arc  isomorphic  to  some  non-regular  language.  Since 
every  fronted  NP  must  have  its  associated  verb,  these  sentences  arc  of  the 
form: 
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(the  + noun)”  (transitive  verb)"  1 likes  tuna  fish. 

The  idea  of  the  proof  will  be  to  show  that  sentences  of  these  struc- 
tures can  be  produced  by  intersecting  English  with  a regular  expression.  We 
will  then  use  the  pumping  lemma  to  prove  that  the  resulting  language  isn’t 
regular. 

In  order  to  build  a simple  regular  expression  that  we  can  intersect  with 
English  to  produce  these  sentences,  we  define  regular  expressions  for  the 
noun  groups  (A)  and  the  verbs  (By. 

A = { the  cat,  the  dog,  the  rat,  the  elephant,  the  kangaroo,. . . } 

B = { chased,  bit,  admired,  ate,  befriended,  . . . } 

Now  if  we  take  the  regular  expression /A*  B*  likes  tuna  fish/ 
and  intersect  it  with  English  (considered  as  a set  of  strings),  the  resulting  lan- 
guage is: 

L = x"y"-1  likes  tuna  fish,  x £ A . y C B 

This  language  L can  be  shown  to  be  non-regular  via  the  pumping 
lemma  (see  Exercise  13.2).  Since  the  intersection  of  English  with  a regu- 
lar- language  is  not  a regular  language,  English  cannot  be  a regular  language 
either. 

The  two  arguments  we  have  seen  so  far  are  based  on  English  syntax. 
There  are  also  arguments  against  the  finite-state  nature  of  English  based  on 
English  morphology.  These  morphological  arguments  are  a different  kind 
of  argument,  because  they  don’t  prove  that  English  morphology  couldn’t  be 
regular,  only  that  a context-free  model  of  English  morphology  is  much  more 
elegant  and  captures  some  useful  descriptive  generalizations.  Let’s  summa- 
rize one  from  Sproat  (1993)  on  the  prefix  en-.  Like  other  English  verbs,  the 
verbs  formed  with  this  prefix  can  take  the  suffix  -able.  So  for  example  the 
verbs  enjoy  and  enrich  can  be  suffixed  ( enjoyable , enrichable).  But  the  noun 
or  adjective  stems  themselves  cannot  take  the  -able  (so  *joyable,  *richable). 
In  other  words,  -able  can  attach  if  the  verb-forming  prefix  en-  has  already 
attached,  but  not  if  it  hasn’t. 

The  reason  for  this  is  very  simple;  en-  creates  verbs,  and  -able  only  at- 
taches to  verbs.  But  expressing  this  fact  in  a regular  grammar  has  an  annoy- 
ing and  inelegant  redundancy;  it  would  have  to  have  two  paths,  one  through 
joy,  one  through  enjoy,  leading  to  different  states,  as  follows: 

This  morphological  fact  is  easy  to  express  in  a context-free  grammar; 
this  is  left  as  an  exercise  for  the  reader. 

This  kind  of  ‘elegance’  argument  against  regular  grammars  also  has 
been  made  for  syntactic  phenomena.  For  example  a number  of  scholars  have 
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argued  that  English  number  agreement  cannot  be  captured  by  a regular  (or 
even  a context-free)  grammar.  In  fact,  a simple  regular  grammar  can  model 
number  agreement,  as  Pullum  and  Gazdar  (1982)  show.  They  considered  the 
following  sentences,  which  have  a long-distance  agreement  dependency: 

Which  problem  did  your  professor  say  she  thought  was  unsolv- 
able? 

Which  problems  did  your  professor  say  she  thought  were  unsolv- 
able? 

Here’s  their  regular  (right-linear)  grammar  that  models  these  sentences: 

S — > Which  problem  did  your  professor  say  T 
S — > Which  problems  did  your  professor  say  U 
T — > she  thought  T | you  thought  T | was  unsolvable 
U — > she  thought  U | you  thought  U | were  unsolvable 


So  a regular  grammar  could  model  English  agreement.  The  problem 
with  such  a grammar  is  not  its  computational  power,  but  its  elegance,  as  we 
saw  in  Chapter  9;  such  a regular  grammar  would  have  a huge  explosion  in  the 
number  of  grammar  rules.  But  for  the  purposes  of  computational  complexity, 
agreement  is  not  paid  of  an  argument  that  English  is  not  a regular  language. 


Section  13.3.  Is  Natural  Language  Context-Free? 


485 


13.3  Is  Natural  Language  Context-Free? 

The  previous  section  argued  that  English  (considered  as  a set  of  strings) 
doesn’t  seem  like  a regular  language.  The  natural  next  question  to  ask  is 
whether  English  is  a context-free  language.  This  question  was  first  asked  by 
Chomsky  (1956),  and  has  an  interesting  history;  a number  of  well-known 
attempts  to  prove  English  and  other  languages  non-context-free  have  been 
published,  and  all  except  two  have  been  disproved  after  publication.  One 
of  these  two  correct  (or  at  least  not-yet  disproved)  arguments  derives  from 
the  syntax  of  a dialect  of  Swiss  German;  the  other  from  the  morphology  of 
Bambara,  a Northwestern  Mande  language  spoken  in  Mali  and  neighboring 
countries.  The  interested  reader  should  see  Pullum  (1991,  P-  131-146)  for 
an  extremely  witty  history  of  both  the  incorrect  and  correct  proofs;  this  sec- 
tion will  merely  summarize  one  of  the  correct  proofs,  the  one  based  on  Swiss 
German. 

Both  of  the  correct  arguments,  and  most  of  the  incorrect  ones,  make  use 
of  the  fact  that  the  following  languages,  and  ones  that  have  similar  properties, 
arc  not  context-free: 

(xx  | x <E  {a,b}*}  (13.1) 

This  language  consists  of  sentences  containing  two  identical  strings  concate- 
nated. The  following  related  language  is  also  not  context-free: 

a"bmc"dm  (13.2) 

The  non-context-free  nature  of  such  languages  can  be  shown  using  the  pump- 
ing lemma  for  context-free  languages. 

The  attempts  to  prove  that  the  natural  languages  arc  not  a subset  of 
the  context-free  languages  do  this  by  showing  that  natural  languages  have  a 

CROSS- 

property  of  these  xx  languages  called  cross-serial  dependencies.  In  a cross-  EIpenden 

CIES 

serial  dependency,  words  or  larger  structures  arc  related  in  left-to-right  order 
as  shown  in  Figure  13.6.  A language  that  has  arbitrarily  long  cross-serial 
dependencies  can  be  mapped  to  the  xx  languages. 

The  successful  proof,  independently  proposed  by  Huybregts  (1984) 
and  Shieber  (1985a),  shows  that  a dialect  of  Swiss  German  spoken  in  Zurich 
has  cross-serial  constraints  which  make  certain  parts  of  that  language  equiv- 
alent to  the  non-context-free  language  anbmcndm.  The  intuition  is  that  Swiss 
German  allows  a sentence  to  have  a string  of  dative  nouns  followed  by  a 
string  of  accusative  nouns,  followed  by  a string  of  dative-taking  verbs,  fol- 
lowed by  a string  of  accusative-taking  verbs. 
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Figure  13.6  A schematic  of  a cross-serial  dependency. 

We  will  follow  the  version  of  the  proof  presented  in  Shieber  (1985a). 
First,  he  notes  that  Swiss  German  allows  verbs  and  their  arguments  to  be 
ordered  cross-serially.  Assume  that  all  the  example  clauses  we  present  below 
arc  preceded  by  the  string  “Jan  sait  das”  (“Jan  says  that”): 

(13.3)  ...mer  em  Hans  es  huus  halfed  aastriiche. 

...  we  Hans/DAT  the  house/ACC  helped  paint. 

‘...we  helped  Hans  paint  the  house.’ 

Notice  the  cross-serial  nature  of  the  semantic  dependency:  both  nouns 
precede  both  verbs,  and  em  Hans  (Hans)  is  the  argument  of  halfed  (helped) 
while  es  huus  (the  house)  is  the  argument  of  aastriiche  (paint).  Furthermore, 
there  is  a cross-serial  case  dependency  between  the  nouns  and  verbs;  halfed 
(helped)  requires  the  dative,  and  em  Hans  is  dative,  while  aastriiche  (paint) 
takes  the  accusative,  and  es  huus  (the  house)  is  accusative. 

Shieber  points  out  that  this  case  marking  can  occur  even  across  triply 
embedded  cross-serial  clauses  like  the  following: 

(13.4)  ...mer  d’chind  em  Hans  es  huus  haend 

...  we  the  children/ACC  Hans/DAT  the  house/ACC  have 

wele  laa  lidlfe  aastriiche. 

wanted  to  let  help  paint. 

‘. . . we  have  wanted  to  let  the  children  help  Hans  paint  the  house.’ 

Shieber  notes  that  among  such  sentences,  those  with  all  dative  NPs 
preceding  all  accusative  NPs,  and  all  dative-subcategorizing  V’s  preceding 
all  accusative-subcategorizing  V’s  arc  acceptable. 

Jan  sait  das  mer  (d'chind)*  (em  Hans)*  es  huus  haend  wele  laa* 

halfe*  aastriche. 

Let’s  call  the  regular  expression  above  R.  Since  it’s  a regular  expres- 
sion (you  see  it  only  has  concatenation  and  Kleene  stars)  it  must  define  a 
regular  language,  and  so  we  can  intersect  R with  Swiss  German,  and  if  the 
result  is  context  free,  so  is  Swiss  German. 
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But  it  turns  out  that  Swiss  German  requires  that  the  number  of  verbs 
requiring  dative  objects  ( hcilfe ) must  equal  the  number  of  dative  NPs  (em 
Hans)  and  similarly  for  accusatives.  Furthermore,  an  arbitrary  number  of 
verbs  can  occur  in  a subordinate  clause  of  this  type  (subject  to  performance 
constraints).  This  means  that  the  result  of  intersecting  this  regular  language 
with  Swiss  German  is  the  following  language: 

L = Jan  sait  das  mer  (d'chind)"(em  Hans)m  es  huus  haend  wele 
(laa)"  (halfe)"'  aastriiche. 

But  this  language  is  of  the  form  wanbmxcndmy,  which  is  not  context- 

free  ! 

So  we  can  conclude  that  Swiss  German  is  not  context  free. 

13.4  Complexity  and  Human  Processing 

We  noted  in  passing  earlier  that  many  of  the  sentences  that  were  used  to 
argue  for  the  non-finite  state  nature  of  English  (like  the  ‘center-embedded’ 
sentences)  arc  quite  difficult  to  understand.  If  you  arc  a speaker  of  Swiss 
German  (or  if  you  have  a friend  who  is),  you  will  notice  that  the  long  cross- 
serial sentences  in  Swiss  German  arc  also  rather  difficult  to  follow.  Indeed, 
as  Pullum  and  Gazdar  (1982)  point  out, 

“. . . precisely  those  construction-types  that  figure  in  the  various 
proofs  that  English  is  not  context-free  appeal-  to  cause  massive 
difficulty  in  the  human  processing  system. . . ” 

This  brings  us  to  a second  use  of  the  term  complexity.  In  the  previous 
section  we  talked  about  the  complexity  of  a language.  Here  we  turn  to  a 
question  that  is  as  much  psychological  as  computational:  the  complexity  of 
an  individual  sentence.  Why  are  certain  sentences  hard  to  comprehend?  Can 
this  tell  us  anything  about  computational  processes? 

Many  things  can  make  a sentence  hard  to  understand;  complicated 
meanings,  extremely  ambiguous  sentences,  the  use  of  rare  words,  and  bad 
handwriting  are  just  a few.  Chapter  12  introduced  garden-path  sentences, 
which  are  certainly  complex,  and  showed  that  their  complexity  was  due  to 
improper  choices  made  on  temporarily  ambiguous  sentences  by  the  human 
parser.  But  there  is  a another,  particular,  kind  of  complexity  (often  called 
‘linguistic  complexity’  or  ‘syntactic  complexity’)  that  bears  an  interesting 
relation  to  the  formal-language  complexity  from  the  previous  section.  These 
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arc  sentences  whose  complexity  arises  not  from  rare  words  or  difficult  mean- 
ings, but  from  a particular  combination  of  syntactic  structure  and  human 
memory  limitations.  Here  arc  some  examples  of  sentences  (taken  from  a 
summary  in  Gibson  (1998))  that  cause  difficulties  when  people  try  to  read 
them  (we  will  use  the  # to  mean  that  a sentence  causes  extreme  processing 
difficulty).  In  each  case  the  (ii)  example  is  significantly  more  complex  than 
the  (i)  example: 

(13.5)  (i)  The  cat  likes  tuna  fish. 

(ii)  #The  cat  the  dog  the  rat  the  elephant  admired  bit  chased  likes 
tuna  fish. 

(13.6)  (i)  If  when  the  baby  is  crying,  the  mother  gets  upset,  the  father  will 

help,  so  the  grandmother  can  rest  easily. 

(ii)  #Because  if  when  the  baby  is  crying,  the  mother  gets  upset,  the 
father  will  help,  the  grandmother  can  rest  easily. 

(13.7)  (i)  The  child  damaged  the  pictures  which  were  taken  by  the 

photographer  who  the  professor  met  at  the  party. 

(ii)  #The  pictures  which  the  photographer  who  the  professor  met  at 
the  party  took  were  damaged  by  the  child. 

(13.8)  (i)  The  fact  that  the  employee  who  the  manager  hired  stole  office 

supplies  worried  the  executive. 

(ii)  #The  executive  who  the  fact  that  the  employee  stole  office 
supplies  worried  hired  the  manager. 

The  earliest  work  on  sentences  of  this  type  noticed  that  they  all  exhibit 
nesting  or  center-embedding  (Chomsky,  1957;  Yngve,  1960;  Chomsky  and 
Miller,  1963;  Miller  and  Chomsky,  1963).  That  is,  they  all  contain  exam- 
ples where  a syntactic  category  A is  nested  within  another  category  B,  and 
surrounded  by  other  words  (X  and  Y): 

Ib  X [a]  Y] 

In  each  of  the  examples  above,  paid  (i)  has  zero  or  one  embedding, 
while  paid  (ii)  has  two  or  more  embeddings.  For  example  in  (13.5ii)  above, 
there  arc  3 reduced  relative  clauses  embedded  inside  each  other: 

# [s  The  cat  [y  the  dog  [y  the  rat  [y  the  elephant  admired]  bit] 
chased]  likes  tuna  fish]. 

In  (13.6ii)  above,  the  when  clauses  arc  nested  inside  the  //'clauses  in- 
side the  because  clauses. 
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#[Because  [if  [when  the  baby  is  crying,  the  mother  gets  upset], 

the  father  will  help],  [the  grandmother  can  rest  easily]]. 

In  (13.7ii),  the  relative  clause  who  the  professor  met  at  the  party  is 
nested  in  between  the  photographer  and  took.  The  relative  clause  which  the 
photographer. . . took  is  then  nested  between  The  pictures  and  were  damaged 
by  the  child. 

#The  pictures  [ which  the  photographer  [ who  the  professor  met 

at  the  party  ] took  ] were  damaged  by  the  child. 

Could  we  explain  the  difficulty  of  these  nested  structures  just  by  say- 
ing that  they  arc  ungrammatical  in  English?  The  answer  seems  to  be  no. 
The  structures  that  arc  used  in  the  complex  sentences  in  (13.5ii)-(13.8ii)  arc 
the  same  ones  used  in  the  easier  sentences  (13.5i)-(13.8i).  The  difference 
between  the  easy  and  complex  sentences  seems  to  hinge  on  the  number  of 
embeddings.  But  there  is  no  natural  way  to  write  a grammar  that  allows  N 
embeddings  but  not  N + I embeddings. 

Rather,  the  complexity  of  these  sentences  seems  to  be  a processing 
phenomenon;  some  fact  about  the  human  parsing  mechanism  is  unable  to 
deal  with  these  kinds  of  multiple  nestings.  If  complexity  is  a fact  about 
‘parsers’  rather  than  grammars,  we  would  expect  sentences  to  be  complex 
for  similar  reasons  in  other  languages.  That  is,  other  languages  have  different 
grammars,  but  presumably  some  of  the  architecture  of  the  human  parser  is 
shared  from  language  to  language. 

It  does  seems  to  be  the  case  that  multiply  nested  structures  of  this 
kind  arc  also  difficult  in  other  languages.  For  example  Japanese  allows  a 
singly  nested  clause,  but  an  additional  nesting  makes  a sentence  unprocess- 
able (Cowper,  1976;  Babyonyshev  and  Gibson,  1999). 

(13.9)  Ani-ga  imooto-o  ijimeta. 

older-brother-NOM  younger-sister-ACC  bullied 

‘My  older  brother  bullied  my  younger  sister’ 

(13.10)  Bebiisitaa-wa  [[ani-ga  imooto-o 

babysitter-TOP  [[older-brother-NOM  younger-sister-ACC 

ijimeta]  to]  itta. 

bullied]  that]  said 

‘The  babysitter  said  that  my  older  brother  bullied  my  younger  sister’ 
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(13.11)  #Obasan-wa  [[Bebiisitaa-ga  [[ani-ga 

aunt-TOP  [[babysitter-NOM  [[older-brother-NOM 
imooto-o  ijimeta]  to]  itta]  to]  omotteiru. 

younger-sister-ACC  bullied]  that]  said]  that]  thinks 
‘#My  aunt  thinks  that  the  babysitter  said  that  my  older  brother  bullied 
my  younger  sister’ 

There  arc  a number  of  attempts  to  explain  these  complexity  effects, 
many  of  which  arc  memory-based.  That  is,  they  rely  on  the  intuition  that 
each  embedding  requires  some  memory  resource  to  store.  A sentence  with 
too  much  embedding  either  uses  up  too  many  memory  resources,  or  creates 
multiple  memory  traces  that  arc  confusable  with  each  other.  The  result  is 
that  the  sentence  is  too  hard  to  process  at  all. 

For  example  Yngve  (1960)  proposed  that  the  human  parser  is  based  on 
a limited-size  stack.  A stack-based  parser  places  incomplete  phrase-structure 
rules  on  the  stack;  if  multiple  incomplete  phrases  arc  nested,  the  stack  will 
contain  an  entry  for  each  of  these  incomplete  rules.  Yngve  suggests  that 
the  more  incomplete  phrase-structure  rules  the  parser  needs  to  store  on  the 
stack,  the  more  complex  the  sentence.  Yngve’s  intuition  was  that  these  stack 
limits  might  mean  that  English  is  actually  a regular  rather  than  context-free 
language,  since  a context-free  grammar  with  a finite  limit  on  its  stack-size 
can  be  modeled  by  a finite  automaton. 

An  extension  to  this  model  (Miller  and  Chomsky,  1963)  proposes  that 
self-embedded  structures  are  particularly  difficult.  A self-embedded  struc- 
ture contains  a syntactic  category  A nested  within  another  example  of  A,  and 
surrounded  by  other  words  (X  and  Y): 

Lt  X U]  Y] 

Such  structures  might  be  difficult  because  a stack-based  parser  might 
confused  two  copies  of  the  rule  on  the  stack.  This  problem  with  self-embedding 
is  also  naturally  modeled  with  an  activation-based  model,  which  might  have 
only  one  copy  of  a particular  rule. 

Although  these  classic  parser-based  explanations  have  intuitive  appeal, 
and  tie  in  nicely  to  the  formal  language  complexity  issues,  it  seems  un- 
likely that  they  arc  correct.  One  problem  with  them  is  that  there  arc  lots 
of  syntactic  complexity  effects  that  aren’t  explained  by  these  models.  For 
example  there  arc  significant  complexity  differences  between  sentences  that 
have  the  same  number  of  embeddings,  such  as  the  well-known  difference  be- 
tween subject-extracted  relative  clauses  ( 13.12ii)  and  object-extracted  rela- 
tive clauses  ( 13.12i): 
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(13.12)  (i)  [j  The  reporter  [y  who  [5  the  senator  attacked  ]]  admitted  the 

error  ]. 

(ii)  [5  The  reporter  [y  who  [5  attacked  the  senator  ]]  admitted  the 
error  ]. 

The  object-extracted  relative  clauses  are  more  difficult  to  process  (mea- 
sured for  example  by  the  amount  of  time  it  takes  to  read  them  (Ford,  1983), 
and  other  factors;  see  for  example  Wanner  and  Marat sos  (1978)  and  King 
and  Just  (1991),  and  Gibson  (1998)  for  a survey).  Different  researchers  have 
hypothesized  a number  of  different  factors  that  might  explain  this  complex- 
ity difference. 

For  example  MacWhinney  and  colleages  MacWhinney  (1977,  1982), 
MacWhinney  and  Csaba  Pleh  (1988)  suggest  that  it  causes  difficulty  for 
reader  to  shift  perspective  from  one  clause  participant  to  another.  Object 
relative  require  two  perspective  shifts  (from  the  matrix  subject  to  the  relative 
clause  subject  and  then  back)  while  subject  relatives  require  none  (the  matrix 
subject  is  the  same  as  the  relative  clause  subject).  Another  potential  source 
of  the  difficulty  in  the  object-extraction  is  that  the  first  noun  ( the  reporter ) 
plays  two  different  thematic  roles  - agent  of  one  clause,  patient  of  the  other. 

This  conflicting  role-assignment  may  cause  difficulties  (Bever,  1970). 

Gibson  (1998)  points  out  that  there  is  another  important  difference  be- 
tween the  object  and  subject  extractions:  the  object  extraction  has  two  nouns 
that  appeal-  before  any  verb.  The  reader  must  hold  on  to  these  two  nouns 
without  knowing  how  they  will  fit  into  the  sentences.  Having  multiple  noun 
phrases  lying  around  that  aren’t  integrated  into  the  meaning  of  the  sentence 
presumably  causes  complexity  for  the  reader. 

Based  on  this  observation,  Gibson  proposes  the  Syntactic  Prediction 
Locality  Theory  (SPLT),  which  predicts  that  the  syntactic  memory  load  as-  splt 
sociated  with  a structure  is  the  sum  of  the  memory  loads  associated  with 
each  of  the  words  that  are  obligatorily  required  to  complete  the  sentence.  A 
sentence  with  multiple  noun  phrases  and  no  verbs  will  require  multiple  verbs 
before  the  sentence  is  complete,  and  will  thus  have  a high  load.  Memory  load 
is  also  based  on  how  many  other  new  phrases  or  discourse  referents  have  to 
be  held  in  memory  at  the  same  time.  Thus  the  memory  load  for  a word  is 
higher  if  there  have  been  many  intervening  new  discourse  referents  since  the 
word  has  been  predicted.  Thus  while  a sequence  of  unintegrated  NPs  is  very 
complex,  a sequence  in  which  one  of  the  two  NPs  is  a pronoun  referring  to 
someone  already  in  the  discourse  is  less  complex.  For  example  the  follow- 
ing examples  of  doubly  nested  relative  clauses  are  processable  because  the 
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innermost  NP  (7)  does  not  introduce  a new  discourse  entity. 

(13.13)  (a)  A syntax  book  [that  some  Italian  [that  I had  never  heard  of  ] 
wrote  ] was  published  by  MIT  Press  (Frank,  1992) 

(b)  The  pictures  [ that  the  photographer  [ who  I met  at  the  party  ] 
took  ] turned  out  very  well.  ( Bevet ; personal  communication  to 
E.  Gibson) 

In  summary,  the  early  suggestions  that  the  complexity  of  human  sen- 
tence processing  is  related  to  memory  seem  to  be  correct  at  some  level;  com- 
plexity in  both  natural  and  formal  languages  is  caused  by  the  need  to  keep 
many  un-integrated  things  in  memory.  This  is  a deep  and  fascinating  find- 
ing about  language  processing.  But  the  relation  between  formal  and  natural 
complexity  is  not  as  simple  as  Yngve  and  others  thought.  Exactly  which 
factors  do  play  a role  in  complexity  is  an  exciting  research  area  that  is  just 
beginning  to  be  investigated. 


13.5  Summary 

This  chapter  introduced  two  different  ideas  of  complexity:  the  complexity 
of  a formal  language,  and  the  complexity  of  a human  sentence. 

• Grammars  can  be  characterized  by  their  generative  power.  One  gram- 
mar- is  of  greater  generative  power  or  complexity  than  another  if  it  can 
define  a language  that  the  other  cannot  define.  The  Chomsky  hier- 
archy is  a hierarchy  of  grammar's  based  on  their  generative  power.  It 
includes  Turing  equivalent,  context-sensitive,  context-free,  and  reg- 
ular grammars. 

• The  pumping  lemma  can  be  used  to  prove  that  a given  language  is  not 
regular.  English  is  not  a regular-  language,  although  the  kinds  of  sen- 
tences that  make  English  non-regular-  are  exactly  those  that  are  hard  for 
people  to  parse.  Despite  many  decades  of  attempts  to  prove  the  con- 
trary, English  does,  however,  seem  to  be  a context-free  language.  The 
syntax  of  Swiss-German  and  the  morphology  of  Bambara,  by  contrast, 
are  not  context-free,  and  seem  to  require  context-sensitive  grammar's. 

• Center-embedded  sentences  are  har'd  for  people  to  parse.  Many  the- 
ories agree  that  this  difficulty  is  somehow  caused  by  memory  limita- 
tions of  the  human  parser. 
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Bibliographical  and  Historical  Notes 

Chomsky  (1956)  first  asked  whether  finite-state  automata  or  context-free 
grammars  were  sufficient  to  capture  the  syntax  of  English.  His  suggestion 
in  that  paper  that  English  syntax  contained  “examples  that  arc  not  easily  ex- 
plained in  terms  of  phrase  structure”  was  a motivation  for  his  development  of 
syntactic  transformations.  Pullum  (1991,  p.  131-146)  is  the  definitive  histor- 
ical study  of  research  on  the  non-context-free-ness  of  natural  language.  The 
early  history  of  attempts  to  prove  natural  languages  non-context-free  is  sum- 
marized in  Pullum  and  Gazdar  (1982).  The  pumping  lemma  was  originally 
presented  by  Bar-Hillel  el  al.  (1961),  who  also  offer  a number  of  impor- 
tant proofs  about  the  closure  and  decidability  properties  of  finite-state  and 
context-free  languages.  Further  details,  including  the  pumping  lemma  for 
context-free  languages  (also  due  to  Bar-Hillel  el  al.  (1961))  can  be  found  in 
a textbook  in  automata  theory  such  as  Hopcroft  and  Ullman  (1979). 

Yngve’s  idea  that  the  difficulty  of  center-embedded  sentences  could  be 
explained  if  the  human  parser  was  finite-state  was  taken  up  by  Church  (1980) 
in  his  master’s  thesis.  He  showed  that  a finite-state  parser  that  implements 
this  idea  could  also  explain  a number  of  other  grammatical  and  psycholin- 
guistic  phenomena.  While  the  field  has  turned  toward  more  sophisticated 
models  of  complexity.  Church's  work  can  be  seen  as  the  beginning  of  the 
return  to  finite-state  models  that  characterized  the  1980’s  and  1990's. 

There  arc  a number  of  other  ways  of  looking  at  complexity  that  we 
didn’t  have  space  to  go  into  here.  One  is  whether  language  processing  is 
NP-complete.  NP-complete  is  the  name  of  a class  of  problems  which  arc  complete 
suspected  to  be  particularly  difficult  to  process  Barton  et  al.  (1987)  prove  a 
number  of  complexity  results  about  the  NP-completeness  of  natural  language 
recognition  and  parsing.  Among  other  things,  they  showed  that 

1.  maintaining  lexical  and  agreement  feature  ambiguities  over  a poten- 
tially infinite-length  sentence  causes  the  problem  of  recognizing  sen- 
tences in  some  unification-based  formalisms  like  Lexical-Functional 
Grammar  to  be  NP-complete. 

2.  Two-level  morphological  parsing  (or  even  just  mapping  between  lexi- 
cal and  surface  form)  is  also  NP-complete. 

Recent  work  has  also  begun  to  link  processing  complexity  with  information- 
theoretic  measures  like  Kolmogorov  complexity  (Juola,  1999). 
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Exercises 

13.1  Is  the  language  anb2an  context-free? 

13.2  Use  the  pumping  lemma  to  show  this  language  is  not  regular: 

L = xnyn~ 1 likes  tuna  fish,  x E A . y E B 

13.3  Partee  (1990)  showed  that  the  language  xxR . x E a.b*  is  not  regular,  by 
intersecting  it  with  the  regular  language  aa*bbaa*.  The  resulting  language 
is  anb2an.  Use  the  pumping  lemma  to  show  that  this  language  is  not  regular, 
completing  the  proof  that  xxR  .x  E a.b*  is  not  regular. 

13.4  Build  a context-free  grammar  for  the  language 

L = {x^\x  E a.b*} 

13.5  Using  a context-free  grammar  to  represent  the  English  morphological 
facts  described  in  Figure  13.5.  Assume  that  en-  applies  to  a particular  class 
of  adjectives  (call  it  Adj^)  and  nouns  (call  it  Noun\^). 


Part  III 

SEMANTICS 


Semantics  is  the  study  of  the  meaning  of  linguistic  utterances.  For 
our  purposes,  this  amounts  to  the  study  of  formal  representations  that 
are  capable  of  capturing  the  meanings  of  linguistic  utterances,  and 
the  study  of  algorithms  that  are  capable  of  mapping  from  linguistic 
utterances  to  appropriate  meaning  representations.  As  we  will  see,  the 
most  important  topic  to  be  addressed  in  this  study  is  how  the  meaning 
of  an  utterance  is  related  to  the  meanings  of  the  phrases,  words,  and 
morphemes  that  make  it  up.  Following  tradition,  issues  related  to 
speakers  and  hearers,  and  the  context  in  which  utterances  are  found, 
will  be  deferred  to  Part  IV,  which  takes  up  the  topic  of  Pragmatics. 

This  part  of  the  book  begins  by  exploring  ways  to  represent  the 
meaning  of  utterances,  focusing  on  the  use  of  First  Order  Predicate 
Calculus.  It  next  explores  various  theoretical  and  practical  approaches 
to  compositional  semantic  analysis,  as  well  as  its  use  in  practical  prob- 
lems such  as  question  answering  and  information  extraction.  It  next 
turns  to  the  topic  of  the  meanings  of  individual  words,  the  role  of 
meaning  in  the  organization  of  a lexicon,  and  algorithms  for  word- 
sense  disambiguation.  Finally,  it  covers  the  topic  of  information  re- 
trieval, an  application  area  of  great  importance  that  operates  almost 
entirely  on  the  basis  of  individual  word  meanings. 
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REPRESENTING 

MEANING 


ISHMAEL:  Surely  all  this  is  not  without  meaning. 
Herman  Melville,  Moby  Dick 


The  approach  to  semantics  that  is  introduced  here,  and  is  elaborated 
on  in  the  next  four  chapters,  is  based  on  the  notion  that  the  meaning  of  lin- 
guistic utterances  can  be  captured  in  formal  structures,  which  we  will  call 
meaning  representations.  Correspondingly,  the  frameworks  that  arc  used 
to  specify  the  syntax  and  semantics  of  these  representations  will  be  called 
meaning  representation  languages.  These  meaning  representations  play 
a role  analogous  to  that  of  the  phonological,  morphological,  and  syntactic 
representations  introduced  in  earlier  chapters. 

The  need  for  these  representations  arises  when  neither  the  raw  linguis- 
tic inputs,  nor  any  of  the  structures  derivable  from  them  by  any  of  the  trans- 
ducers we  have  studied,  facilitate  the  kind  of  semantic  processing  that  is  de- 
sired. More  specifically,  what  is  needed  arc  representations  that  can  bridge 
the  gap  from  linguistic  inputs  to  the  kind  of  non-linguistic  knowledge  needed 
to  perform  a variety  of  tasks  involving  the  meaning  of  linguistic  inputs. 

To  illustrate  this  idea,  consider  the  following  everyday  language  tasks 
that  require  some  form  of  semantic  processing. 


MEANING 

REPRESENTA- 

TIONS 


MEANING 


REPRESENTA- 


TION 


LANGUAGES 


• Answering  an  essay  question  on  an  exam. 

• Deciding  what  to  order  at  a restaurant  by  reading  a menu. 

• Learning  to  use  a new  piece  of  software  by  reading  the  manual. 

• Realizing  that  you’ve  been  insulted. 

• Following  a recipe. 


498 


Chapter  14.  Representing  Meaning 


It  should  be  clear  that  simply  having  access  to  the  kind  of  phonological,  mor- 
phological, and  syntactic  representations  we  have  discussed  thus  far  will  not 
get  us  very  far  on  accomplishing  any  of  these  tasks.  These  tasks  require  ac- 
cess to  representations  that  link  the  linguistic  elements  involved  in  the  task  to 
the  non-linguistic  knowledge  of  the  world  needed  to  successfully  accomplish 
them.  For  example,  some  of  the  knowledge  of  the  world  needed  to  perform 
the  above  tasks  includes: 

• Answering  and  grading  essay  questions  requires  background  knowl- 
edge about  the  topic  of  the  question,  the  desired  knowledge  level  of 
the  students,  and  how  such  questions  arc  normally  answered. 

• Reading  a menu  and  deciding  what  to  order,  giving  advice  about  where 
to  go  to  dinner,  following  a recipe,  and  generating  new  recipes  all  re- 
quire deep  knowledge  about  food,  its  preparation,  what  people  like  to 
eat  and  what  restaurants  arc  like. 

• Learning  to  use  a piece  of  software  by  reading  a manual,  or  giving  ad- 
vice about  how  to  do  the  same,  requires  deep  knowledge  about  current 
computers,  the  specific  software  in  question,  similar  software  applica- 
tions, and  knowledge  about  users  in  general. 

In  the  representational  approach  being  explored  here,  we  take  linguis- 
tic inputs  and  construct  meaning  representations  that  arc  made  up  of  the 
same  kind  of  stuff  that  is  used  to  represent  this  kind  of  everyday  common- 
sense  knowledge  of  the  world.  The  process  whereby  such  representations 
arc  created  and  assigned  to  linguistic  inputs  is  called  semantic  analysis. 

To  make  this  notion  more  concrete,  consider  Figure  14. 1,  which  shows 
sample  meaning  representations  for  the  sentence  I have  a car  using  four 
frequently  used  meaning  representation  languages.  The  first  row  illustrates  a 
sentence  in  First  Order  Predicate  Calculus,  which  will  be  covered  in  detail  in 
Section  14.3;  the  graph  in  the  center  illustrates  a Semantic  Network  , which 
will  be  discussed  further  in  Section  14.5;  the  third  row  contains  a Conceptual 
Dependency  diagram,  discussed  in  more  detail  in  Chapter  16,  and  finally  a 
frame -based  representation,  also  covered  in  Section  14.5. 

While  there  arc  a number  of  significant  differences  among  these  four 
approaches  to  representation,  at  an  abstract  level  they  all  share  as  a common 
foundation  the  notion  that  a meaning  representation  consists  of  structures 
composed  from  a set  of  symbols.  When  appropriately  arranged,  these  sym- 
bol structures  arc  taken  to  correspond  to  objects,  and  relations  among  ob- 
jects, in  some  world  being  represented.  In  this  case,  all  four  representations 
make  use  of  symbols  corresponding  to  the  speaker,  a car,  and  a number  of 


499 


3 x,yHaving(x)  A Haver  (Speaker, x ) 

\ / \HadThing(y,x ) A Car(y) 

^ Having 

Haver 

Had-Thing 

V 

Speaker 

Car 

Car 

Having 

ff-  POSS-BY 

Haver:  Speaker 

Speaker 

HadThing:  Car 

Figure  14.1  A list  of  symbols,  two  directed  graphs,  and  a record  structure: 
a sampler  of  meaning  representations  for  I have  a car. 

relations  denoting  the  possession  of  one  by  the  other. 

It  is  important  to  note  that  these  representations  can  be  viewed  from  at 
least  two  distinct  perspectives  in  all  four  of  these  approaches:  as  represen- 
tations of  the  meaning  of  the  particular  linguistic  input  I have  a car,  and  as 
representations  of  the  state  of  affairs  in  some  world.  It  is  this  dual  perspec- 
tive that  allows  these  representations  to  be  used  to  link  linguistic  inputs  to 
the  world  and  to  our  knowledge  of  it. 

The  structure  of  this  paid  of  the  book  parallels  that  of  the  previous  parts. 
We  will  alternate  discussions  of  the  nature  of  meaning  representations  with 
discussions  of  the  computational  processes  that  can  produce  them.  More 
specifically,  this  chapter  introduces  the  basics  of  what  is  needed  in  a mean- 
ing representation,  while  Chapter  15  introduces  a number  of  techniques  for 
assigning  meanings  to  linguistic  inputs.  Chapter  16  explores  a range  of  com- 
plex representational  issues  related  to  the  meanings  of  words.  Chapter  17 
then  explores  some  robust  computational  methods  designed  to  exploit  these 
lexical  representations. 

Note  that  since  the  emphasis  of  this  chapter  is  on  the  basic  require- 
ments of  meaning  representations,  we  will  defer  a number  of  extremely  im- 
portant issues  to  later  chapters.  In  particular,  the  focus  of  this  chapter  is  on 
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MEANING 


14.1 


representing  what  is  sometimes  called  the  literal  meaning  of  sentences.  By 
this,  we  have  in  mind  representations  that  arc  closely  tied  to  the  conventional 
meanings  of  the  words  that  arc  used  to  create  them,  and  that  do  not  reflect 
the  context  in  which  they  occur.  The  shortcomings  of  such  representations 
with  respect  to  phenomena  such  as  idioms  and  metaphor  will  be  discussed 
in  the  next  two  chapters,  while  the  role  of  context  in  ascertaining  the  deeper 
meaning  of  sentences  will  be  covered  in  Chapters  18  and  19. 

There  arc  three  major  parts  to  this  chapter.  Section  14.1  explores  some 
of  the  practical  computational  requirements  for  what  is  needed  in  a meaning 
representation  language.  Section  14.2  then  discusses  some  of  the  ways  that 
language  is  structured  to  convey  meaning.  Section  14.3  then  provides  an 
introduction  to  First  Order  Predicate  Calculus,  which  has  historically  been 
the  principal  technique  used  to  investigate  semantic  issues. 


Computational  Desiderata  for  Representations 

We  begin  by  considering  the  issue  of  why  meaning  representations  arc  needed 
and  what  they  should  do  for  us.  To  focus  this  discussion,  we  will  consider  in 
more  detail  the  task  of  giving  advice  about  restaurants  to  tourists.  In  this  dis- 
cussion, we  will  assume  that  we  have  a computer  system  that  accepts  spoken 
language  queries  from  tourists  and  construct  appropriate  responses  by  using 
a knowledge  base  of  relevant  domain  knowledge.  A series  of  examples  will 
serve  to  introduce  some  of  the  basic  requirements  that  a meaning  represen- 
tation must  fulfill,  and  some  of  the  complications  that  inevitably  arise  in  the 
process  of  designing  such  meaning  representations.  In  each  of  these  exam- 
ples, we  will  examine  the  role  that  the  representation  of  the  meaning  of  the 
request  must  play  in  the  process  of  satisfying  it. 

Verifiability 

Let  us  begin  by  considering  the  following  simple  question. 

(14.1)  Does  Maharani  serve  vegetarian  food? 

This  example  illustrates  the  most  basic  requirement  for  a meaning  represen- 
tation: it  must  be  possible  to  use  the  representation  to  determine  the  relation- 
ship between  the  meaning  of  a sentence  and  the  world  as  we  know  it.  In  other 
words,  we  need  to  be  able  to  determine  the  truth  of  our  representations.  The 
most  straightforward  way  to  implement  this  notion  is  make  it  possible  for  a 
system  to  compare,  or  match , the  representation  of  the  meaning  of  an  input 
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against  the  representations  in  its  knowledge  base,  its  store  of  information 
about  its  world. 

In  this  example,  let  us  assume  that  the  meaning  of  this  question  con- 
tains, as  a component,  the  meaning  underlying  the  proposition  Maharani 
serves  vegetarian  food.  For  now,  we  will  simply  gloss  this  representation  as: 

Serves  ( Maharani . VegetarianFood ) 

It  is  this  representation  of  the  input  that  will  be  matched  against  the 
knowledge  base  of  facts  about  a set  of  restaurants.  If  the  system  finds  a 
representation  matching  the  input  proposition  in  its  knowledge  base,  it  can 
return  an  affirmative  answer.  Otherwise,  it  must  either  say  No,  if  its  knowl- 
edge of  local  restaurants  is  complete,  or  say  that  it  does  not  know  if  there  is 
reason  to  believe  that  its  knowledge  is  incomplete. 

This  notion  is  known  as  verifiability,  and  concerns  a system’s  ability 
to  compare  the  state  of  affairs  described  by  a representation  to  the  state  of 
affairs  in  some  world  as  modeled  in  a knowledge  base.  1 

Unambiguous  Representations 

The  domain  of  semantics,  like  all  the  other  domains  we  have  studied,  is 
subject  to  ambiguity.  Specifically,  single  linguistic  inputs  can  legitimately 
have  different  meaning  representations  assigned  to  them  based  on  the  cir- 
cumstances in  which  they  occur. 

Consider  the  following  example  from  the  BERP  corpus. 

(14.2)  I wanna  eat  someplace  that’s  close  to  ICSI. 

Given  the  allowable  argument  structures  for  the  verb  eat,  this  sentence  can 
either  mean  that  the  speaker  wants  to  eat  at  some  nearby  location,  or  under 
a Godzilla  as  speaker  interpretation,  the  speaker  may  want  to  devour  some 
nearby  location.  The  answer  generated  by  the  system  for  this  request  will 
depend  on  which  interpretation  is  chosen  as  the  correct  one. 

Since  ambiguities  such  as  this  abound  in  all  genres  of  all  languages, 
some  means  of  determining  that  certain  interpretations  arc  preferable  (or 
alternatively  less  preferable)  than  others  is  needed.  The  various  linguistic 
phenomenon  that  give  rise  to  such  ambiguities,  and  the  techniques  that  can 
be  employed  to  deal  with  them,  will  be  discussed  in  detail  in  the  next  four 
chapters. 

1 This  is  a fairly  practical  characterization  of  verifiability.  More  theoretical  views  of  this 
notion  are  briefly  covered  in  Section  14.6. 
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Our  concern  in  this  chapter,  however,  is  with  the  status  of  our  meaning 
representations  with  respect  to  ambiguity,  and  not  with  how  we  arrive  at 
correct  interpretations.  Since  we  reason  about,  and  act  upon,  the  semantic 
content  of  linguistic  inputs,  the  final  representation  of  an  input’s  meaning 
should  be  free  from  any  ambiguity.  Therefore,  regardless  of  any  ambiguity 
in  the  raw  input,  it  is  critical  that  a meaning  representation  language  support 
representations  that  have  a single  unambiguous  interpretation.  2 
vagueness  A concept  closely  related  to  ambiguity  is  vagueness.  Like  ambiguity, 

vagueness  can  make  it  difficult  to  determine  what  to  do  with  a particular 
input  based  on  its  meaning  representation.  Vagueness,  however,  does  not 
give  rise  to  multiple  representations. 

Consider  the  following  request  as  an  example. 

(14.3)  I want  to  eat  Italian  food. 

While  the  use  of  the  phrase  Italian  food  may  provide  enough  information  for 
a restaurant  advisor  to  provide  reasonable  recommendations,  it  is  neverthe- 
less quite  vague  as  to  what  the  user  really  wants  to  eat.  Therefore,  a vague 
representation  of  the  meaning  of  this  phrase  may  be  appropriate  for  some 
purposes,  while  a more  specific  representation  may  be  needed  for  other  pur- 
poses. It  will,  therefore,  be  advantageous  for  a meaning  representation  lan- 
guage to  support  representations  that  maintain  a certain  level  of  vagueness. 
Note  that  it  is  not  always  easy  to  distinguish  ambiguity  from  vagueness. 
Zwicky  and  Sadock  (1975)  provide  a useful  set  of  tests  that  can  be  used  as 
diagnostics. 

Canonical  Form 

The  notion  that  single  sentences  can  be  assigned  multiple  meanings  leads  to 
the  related  phenomenon  of  distinct  inputs  that  should  be  assigned  the  same 
meaning  representation.  Consider  the  following  alternative  ways  of  express- 
ing Example  14.1. 

(14.4)  Does  Maharani  have  vegetarian  dishes? 

(14.5)  Do  they  have  vegetarian  food  at  Maharani? 

(14.6)  Are  vegetarian  dishes  served  at  Maharani? 

(14.7)  Does  Maharani  serve  vegetarian  fare? 

2 This  does  not  foreclose  the  use  of  intermediate  semantic  representations  that  maintain 
some  level  of  ambiguity  on  the  way  to  a single  unambiguous  form.  Examples  of  such  repre- 
sentations will  be  discussed  in  Chapter  15. 
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Given  that  these  alternatives  use  different  words  and  have  widely  vary- 
ing syntactic  analyses,  it  would  not  be  unreasonable  to  expect  them  to  have 
substantially  different  meaning  representations.  Such  a situation  would, 
however,  have  undesirable  consequences  for  our  matching  approach  to  de- 
termining the  truth  of  our  representations.  If  the  system's  knowledge  base 
contains  only  a single  representation  of  the  fact  in  question,  then  the  rep- 
resentations underlying  all  but  one  of  our  alternatives  will  fail  to  produce  a 
match.  We  could,  of  course,  store  all  possible  alternative  representations  of 
the  same  fact  in  the  knowledge  base,  but  this  would  lead  to  an  enormous 
number  of  problems  related  to  keeping  such  a knowledge  base  consistent. 

The  way  out  of  this  dilemma  is  motivated  by  the  fact  that  since  the  an- 
swers given  for  each  of  these  alternatives  should  be  the  same  in  all  situations, 
we  might  say  that  they  all  mean  the  same  thing,  at  least  for  the  purposes  of 
giving  restaurant  recommendations.  In  other  words,  at  least  in  this  domain, 
we  can  legitimately  consider  assigning  the  same  meaning  representation  to 
the  propositions  underlying  each  of  these  requests.  Taking  such  an  approach 
would  guarantee  that  our  matching  scheme  for  answering  Yes-No  questions 
will  still  work. 

The  notion  that  inputs  that  mean  the  same  thing  should  have  the  same 
meaning  representation  is  known  as  the  doctrine  of  canonical  form.  This 
approach  greatly  simplifies  various  reasoning  tasks  since  systems  need  only 
deal  with  a single  meaning  representation  for  a potentially  wide  range  of 
expressions. 

Canonical  form  does,  of  course,  complicate  the  task  of  semantic  anal- 
ysis. To  see  this,  note  that  the  alternatives  given  above  use  completely  dif- 
ferent words  and  syntax  to  refer  to  vegetarian  fare  and  to  what  restaurants  do 
with  it.  More  specifically,  to  assign  the  same  representation  to  all  of  these 
requests  our  system  will  have  to  conclude  that  vegetarian  fare,  vegetarian 
dishes  and  vegetarian  food  refer  to  the  same  thing  in  this  context,  that  the 
use  here  of  having  and  serving  arc  similarly  equivalent,  and  that  the  differ- 
ent syntactic  parses  underlying  these  requests  arc  all  compatible  with  the 
same  meaning  representation. 

Being  able  to  assign  the  same  representation  to  such  diverse  inputs  is 
a tall  order.  Fortunately  there  arc  some  systematic  meaning  relationships 
among  word  senses  and  among  grammatical  constructions  that  can  be  ex- 
ploited to  make  this  task  tractable.  Consider  the  issue  of  the  meanings  of 
the  words  food,  dish  and  fare  in  these  examples.  A little  introspection,  or  a 
glance  at  a dictionary,  reveals  that  these  words  have  a fair  number  of  distinct 
uses.  Fortunately,  it  also  reveals  that  there  is  at  least  one  sense  that  is  shared 
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among  them  all.  If  a system  has  the  ability  to  choose  that  shared  sense,  then 
an  identical  meaning  representation  can  be  assigned  to  the  phrases  contain- 
ing these  words. 

In  general,  we  say  that  these  words  all  have  various  word  senses  and 
that  some  of  the  senses  arc  synonymous  with  one  another.  The  process  of 
choosing  the  right  sense  in  context  is  called  word  sense  disambiguation, 
or  word  sense  tagging  by  analogy  to  part-of-speech  tagging.  The  topics  of 
synonymy,  sense  tagging,  and  a host  of  other  topics  related  to  word  meanings 
will  be  covered  in  Chapters  16  and  17.  Suffice  it  to  say  here  that  the  fact  that 
inputs  may  use  different  words  does  not  preclude  the  assignment  of  identical 
meanings  to  them. 

Just  as  there  arc  systematic  relationships  among  the  meanings  of  dif- 
ferent words,  there  arc  similar  relationships  related  to  the  role  that  syntactic 
analyses  play  in  assigning  meanings  to  sentences.  Specifically,  alternative 
syntactic  analyses  often  have  meanings  that  arc,  if  not  identical,  at  least  sys- 
tematically related  to  one  another.  Consider  the  following  pair  of  examples. 

(14.8)  Maharani  serves  vegetarian  dishes. 

(14.9)  Vegetarian  dishes  arc  served  by  Maharani. 

Despite  the  different  placement  of  the  arguments  to  serve  in  these  examples, 
we  can  still  assign  Maharani  and  vegetarian  dishes  to  the  same  roles  in  both 
of  these  examples  because  of  our  knowledge  of  the  relationship  between  ac- 
tive and  passive  sentence  constructions.  In  particular,  we  can  use  knowledge 
of  where  grammatical  subjects  and  direct  objects  appeal-  in  these  construc- 
tions to  assign  Maharani , to  the  role  of  the  server,  and  vegetarian  dishes  to 
the  role  of  thing  being  served  in  both  of  these  examples,  despite  the  fact  that 
they  appeal-  in  different  surface  locations.  The  precise  role  of  the  grammar  in 
the  construction  of  meaning  representations  will  be  covered  in  Chapter  15. 

Inference  and  Variables 

Continuing  with  the  topic  of  the  computational  purposes  that  meaning  rep- 
resentations should  serve,  we  should  consider  more  complex  requests  such 
as  the  following. 

(14.10)  Can  vegetarians  eat  at  Maharani? 

Here,  it  would  be  a mistake  to  invoke  canonical  form  to  force  our  system  to 
assign  the  same  representation  to  this  request  as  for  the  previous  examples. 
The  fact  that  this  request  results  in  the  same  answer  as  the  others  arises  not 
because  they  mean  the  same  thing,  but  because  there  is  a commonsense  con- 
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nection  between  what  vegetarians  eat  and  what  vegetarian  restaurants  serve. 

This  is  a fact  about  the  world  and  not  a fact  about  any  particular  kind  of 
linguistic  regularity.  This  implies  that  no  approach  based  on  canonical  form 
and  simple  matching  will  give  us  an  appropriate  answer  to  this  request.  What 
is  needed  is  a systematic  way  to  connect  the  meaning  representation  of  this 
request  with  the  facts  about  the  world  as  they  arc  represented  in  a knowledge 
base. 

We  will  use  the  term  inference  to  refer  generically  to  a system's  abil-  inference 
ity  to  draw  valid  conclusions  based  on  the  meaning  representation  of  inputs 
and  its  store  of  background  knowledge.  It  must  be  possible  for  the  system 
to  draw  conclusions  about  the  truth  of  propositions  that  arc  not  explicitly 
represented  in  the  knowledge  base,  but  arc  nevertheless  logically  derivable 
from  the  propositions  that  arc  present. 

Now  consider  the  following  somewhat  more  complex  request. 

(14.11)  I’d  like  to  find  a restaurant  where  I can  get  vegetarian  food. 

Unlike  our  previous  examples,  this  request  does  not  make  reference  to  any 
particular  restaurant.  The  user  is  stating  that  they  would  like  information 
about  an  unknown  and  unnamed  entity  that  is  a restaurant  that  serves  veg- 
etarian food.  Since  this  request  does  not  mention  any  particular  restaurant, 
the  kind  of  simple  matching-based  approach  we  have  been  advocating  is  not 
going  to  work.  Rather,  answering  this  request  requires  a more  complex  kind 
of  matching  that  involves  the  use  of  variables.  We  can  gloss  a representation 
containing  such  variables  as  follows. 

Serves  (x,  VegetarianFood ) 

Matching  such  a proposition  succeeds  only  if  the  variable  x can  be  re- 
placed by  some  known  object  in  the  knowledge  base  in  such  a way  that  the 
entire  proposition  will  then  match.  The  concept  that  is  substituted  for  the 
variable  can  then  be  used  to  fulfill  the  user’s  request.  Of  course,  this  simple 
example  only  hints  at  the  issues  involved  in  the  use  of  such  variables.  Suffice 
it  to  say  that  linguistic  inputs  contain  many  instances  of  all  kinds  of  indef- 
inite references  and  it  is  therefore  critical  for  any  meaning  representation 
language  to  be  able  to  handle  this  kind  of  expression. 

Expressiveness 

Finally,  to  be  useful  a meaning  representation  scheme  must  be  expressive 
enough  to  handle  an  extremely  wide  range  of  subject  matter.  The  ideal  sit- 
uation, of  course,  would  be  to  have  a single  meaning  representation  lan- 
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guage  that  could  adequately  represent  the  meaning  of  any  sensible  natural 
language  utterance.  Although  this  is  probably  too  much  to  expect  from  any 
single  representational  system.  Section  14.3  will  show  that  First  Order  Pred- 
icate Calculus  is  expressive  enough  to  handle  quite  a lot  of  what  needs  to  be 
represented. 
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The  previous  section  focused  on  some  of  the  purposes  that  meaning  rep- 
resentations must  serve,  without  saying  much  about  what  we  will  call  the 
meaning  structure  of  language.  By  this,  we  have  in  mind  the  various  meth- 
ods by  which  human  languages  convey  meaning.  These  include  a variety  of 
conventional  form-meaning  associations,  word-order  regularities,  tense  sys- 
tems, conjunctions  and  quantifiers,  and  a fundamental  predicate-argument 
structure.  The  remainder  of  this  section  focuses  exclusively  on  this  last  no- 
tion of  a predicate-argument  structure,  which  is  the  mechanism  that  has  had 
the  greatest  practical  influence  on  the  nature  of  meaning  representation  lan- 
guages. The  remaining  topics  will  be  addressed  in  Chapter  15  where  the 
primary  focus  will  be  on  how  they  contribute  to  how  meaning  representa- 
tions arc  assembled,  rather  than  on  the  nature  of  the  representations. 


Predicate-Argument  Structure 

It  appeal's  to  be  the  case  that  all  human  languages  have  a form  of  predicate- 
argument  arrangement  at  the  core  of  their  semantic  structure.  To  a first  ap- 
proximation, this  predicate-argument  structure  asserts  that  specific  relation- 
ships hold  among  the  various  concepts  underlying  the  constituent  words  and 
phrases  that  make  up  sentences.  It  is  largely  this  underlying  structure  that 
permits  the  creation  of  a single  composite  meaning  representation  from  the 
meanings  of  the  various  parts  of  an  input.  One  of  the  most  important  jobs 
of  a grammar  is  to  help  organize  this  predicate-argument  structure.  Corre- 
spondingly, it  is  critical  that  our  meaning  representation  languages  support 
the  predicate-argument  structures  presented  to  us  by  language. 

We  have  already  seen  the  beginnings  of  this  concept  in  our  discus- 
sion of  verb  complements  in  Chapter  9 and  Chapter  11.  There  we  saw  that 
verbs  dictate  specific  constraints  on  the  number,  grammatical  category,  and 
location  of  the  phrases  that  are  expected  to  accompany  them  in  syntactic 
structures.  To  briefly  review  this  idea,  consider  the  following  examples. 
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(14.12)  I want  Italian  food. 

(14.13)  I want  to  spend  less  than  five  dollars. 

(14.14)  I want  it  to  be  close  by  here. 

These  examples  can  be  classified  as  having  one  of  the  following  three  syn- 
tactic argument  frames. 

NP  want  NP 
NP  want  Inf -VP 
NP  want  NP  Inf -VP 

These  syntactic  frames  specify  the  number,  position  and  syntactic  cat- 
egory of  the  arguments  that  arc  expected  to  accompany  a verb.  For  example, 
the  frame  for  the  variety  of  want  that  appeal's  in  Example  14.12  specifies  the 
following  facts: 

• There  are  two  arguments  to  this  predicate. 

• Both  arguments  must  be  NPs. 

• The  first  argument  is  pre-verbal  and  plays  the  role  of  the  subject. 

• The  second  argument  is  post-verbal  and  plays  the  role  of  the  direct 
object. 

As  we  have  shown  in  previous  chapters,  this  kind  of  information  is  quite 
valuable  in  capturing  a variety  of  important  facts  about  syntax.  By  analyzing 
easily  observable  semantic  information  associated  with  these  frames,  we  can 
also  gain  considerable  insight  into  our  meaning  representations.  We  will 
begin  by  considering  two  extensions  of  these  frames  into  the  semantic  realm: 
semantic  roles  and  semantic  restrictions  on  these  roles. 

The  notion  of  a semantic  role  can  be  understood  by  looking  at  the  sim- 
ilarities among  the  arguments  in  Examples  14.12  through  14.14.  In  each  of 
these  cases,  the  pre-verbal  argument  always  plays  the  role  of  the  entity  do- 
ing the  wanting,  while  the  post- verbal  argument  plays  the  role  of  the  concept 
that  is  wanted.  By  noticing  these  regularities  and  labeling  them  accordingly, 
we  can  associate  the  surface  arguments  of  a verb  with  a set  of  discrete  roles 
in  its  underlying  semantics.  More  generally,  we  can  say  that  verb  subcatego- 
rization frames  allow  the  linking  of  arguments  in  the  surface  structure  with 
the  semantic  roles  these  arguments  play  in  the  underlying  semantic  repre- 
sentation of  an  input.  The  study  of  roles  associated  with  specific  verbs  and 
across  classes  of  verbs  is  usually  referred  to  as  thematic  role  or  case  role 
analysis  and  will  be  studied  in  more  detail  in  Section  14.4  and  Chapter  16. 

The  notion  of  semantic  restrictions  arises  directly  from  these  semantic 
roles.  Returning  to  Examples  14.12  through  14.14,  we  can  see  that  it  is  not 
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merely  the  case  that  each  initial  noun  phrase  argument  will  be  the  wanter 
but  that  only  certain  kinds,  or  categories , of  concepts  can  play  the  role  of 
wanter  in  any  straightforward  manner.  Specifically,  want  restricts  the  con- 
stituents appealing  as  the  first  argument  to  those  whose  underlying  concepts 
can  actually  partake  in  a wanting.  Traditionally,  this  notion  is  referred  to  as 
a selection  restriction.  Through  the  use  of  these  selection  restrictions,  verbs 
can  specify  semantic  restrictions  on  their  arguments. 

Before  leaving  this  topic,  we  should  note  that  verbs  arc  by  no  means 
the  only  objects  in  a grammar  that  can  carry  a predicate-argument  structure. 
Consider  the  following  phrases  from  the  BERP  corpus. 

(14.15)  an  Italian  restaurant  under  fifteen  dollars 

In  this  example,  the  meaning  representation  associated  with  the  preposition 
under  can  be  seen  as  having  something  like  the  following  structure. 

U nder{It  alianRest  aurant  .$15) 

In  other  words,  prepositions  can  be  characterized  as  two-argument  predicates 
where  the  first  argument  is  an  object  that  is  being  placed  in  some  relation  to 
the  second  argument. 

Another  non-verb  based  predicate-argument  structure  is  illustrated  in 
the  following  example. 

(14.16)  make  a reservation  for  this  evening  for  a table  for  two  persons  at  8. 

Here,  the  predicate-argument  structure  is  based  on  the  concept  under- 
lying the  noun  reservation,  rather  than  make,  the  main  verb  in  the  phrase. 
This  example  gives  rise  to  a four  argument  predicate  structure  like  the  fol- 
lowing. 

Reservation  (Hearer,  Today.  8 PM . 2 ) 

This  discussion  makes  it  clear  that  any  useful  meaning  representation 
language  must  be  organized  in  a way  that  supports  the  specification  of  se- 
mantic predicate-argument  structures.  Specifically,  this  support  must  include 
support  for  the  kind  of  semantic  information  that  languages  present: 

• Variable  arity  predicate-argument  structures. 

• The  semantic  labeling  of  arguments  to  predicates. 

• The  statement  of  semantic  constraints  on  the  fillers  of  argument  roles. 
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14.3  First  Order  Predicate  Calculus 

First  Order  Predicate  Calculus  (FOPC)  is  a flexible,  well-understood,  and 
computationally  tractable  approach  to  the  representation  of  knowledge  that 
satisfies  many  of  the  requirements  raised  in  Sections  14.1  and  14.2  for  a 
meaning  representation  language.  Specifically,  it  provides  a sound  computa- 
tional basis  for  the  verifiability,  inference,  and  expressiveness  requirements. 
However,  the  most  attractive  feature  of  FOPC  is  the  fact  that  it  makes  very 
few  specific  commitments  as  to  how  things  ought  to  be  represented.  As  we 
will  see,  the  specific  commitments  it  does  make  arc  ones  that  arc  fairly  easy 
to  live  with;  the  represented  world  consists  of  objects,  properties  of  objects, 
and  relations  among  objects. 

The  remainder  of  this  section  first  provides  an  introduction  to  the  basic 
syntax  and  semantics  of  FOPC  and  then  describes  the  application  of  FOPC 
to  a number  of  linguistically  relevant  topics.  Section  14.6  then  discusses 
the  connections  between  FOPC  and  some  of  the  other  representations  shown 
earlier  in  Figure  14.1. 

Elements  of  FOPC 

We  will  explore  FOPC  in  a bottom-up  fashion  by  first  examining  its 
various  atomic  elements  and  then  showing  how  they  can  be  composed  to 
create  larger  meaning  representations.  Figure  14.2,  which  provides  a com- 
plete context-free  grammar  for  the  particular  syntax  of  FOPC  that  we  will  be 
using,  will  be  our  roadmap  for  this  section. 

Let’s  begin  by  examining  the  notion  of  a Term,  the  FOPC  device  for 
representing  objects.  As  can  be  seen  from  Figure  14.2,  FOPC  provides  three 
ways  to  represent  these  basic  building  blocks:  constants,  functions,  and  vari- 
ables. Each  of  these  devices  can  be  thought  of  as  a way  of  naming,  or  point- 
ing to,  an  object  in  the  world  under  consideration. 

Constants  in  FOPC  refer  to  specific  objects  in  the  world  being  de- 
scribed. Such  constants  arc  conventionally  depicted  as  either  single  capi- 
talized letters  such  as  A and  B or  single  capitalized  words  that  arc  often  rem- 
iniscent of  proper  nouns  such  as  Maharani  and  Harry.  Like  programming 
language  constants,  FOPC  constants  refer  to  exactly  one  object.  Objects  can, 
however,  have  multiple  constants  that  refer  to  them. 

Functions  in  FOPC  correspond  to  concepts  that  which  arc  often  ex- 
pressed in  English  as  genitives  such  as  the  location  of  Maharani  or  Maha- 
rani’s location.  A FOPC  translation  of  such  an  expression  might  look  like 
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Formula  — » AtomicFormula 

Formula  Connective  Formula 
Quantifier  Variable,. . . Formula 
-i  Formula 
( Formula ) 


AtomicFormula 


Predicate  (Term. .. .) 


Term  — > Function ( Term ,...) 
Constant 
Variable 


Connective 

Quantifier 

Constant 

Variable 

Predicate 

Function 


A | V | 

V 3 

A | VegetarianFood  \ Maharani  ■ ■ ■ 
x \ y | ••• 

Serves  \ Near  \ 

LocationO f \ CuisineOf  \ 


Figure  14.2  A context-free  grammar  specification  of  the  syntax  of  First 
Order  Predicate  Calculus  representations.  ( Adapted  from  Russell  and  Norvig 
(1995).) 


the  following. 

LocationO  f (Maharani ) 

FOPC  functions  are  syntactically  the  same  as  single  argument  predicates.  It 
is  important  to  remember,  however,  that  while  they  have  the  appearance  of 
predicates  they  arc  in  fact  Terms  in  that  they  refer  to  unique  objects.  Func- 
tions provide  a convenient  way  to  refer  to  specific  objects  without  having 
to  associate  a named  constant  with  them.  This  is  particularly  convenient  in 
cases  where  many  named  objects,  like  restaurants,  will  have  a unique  con- 
cept such  as  a location  associated  with  them. 
variable  The  notion  of  a variable  is  our  final  FOPC  mechanism  for  referring  to 
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objects.  Variables,  which  arc  normally  depicted  as  single  lower-case  letters, 
give  us  the  ability  to  make  assertions  and  draw  inferences  about  objects  with- 
out having  to  make  reference  to  any  particular  named  object.  This  ability  to 
make  statements  about  anonymous  objects  comes  in  two  flavors:  making 
statements  about  a particular  unknown  object  and  making  statements  about 
all  the  objects  in  some  arbitrary  world  of  objects.  We  will  return  to  the  topic 
of  variables  after  we  have  presented  quantifiers,  the  elements  of  FOPC  that 
will  make  them  useful. 

Now  that  we  have  the  means  to  refer  to  objects,  we  can  move  on  to  the 
FOPC  mechanisms  that  arc  used  to  state  relations  that  hold  among  objects. 

As  one  might  guess  from  its  name,  FOPC  is  organized  around  the  notion  of 
the  predicate.  Predicates  arc  symbols  that  refer  to,  or  name,  the  relations  that 
hold  among  some  fixed  number  of  objects  in  a given  domain.  Returning  to 
the  example  introduced  informally  in  Section  14.1,  a reasonable  FOPC  repre- 
sentation for  Maharani  serves  vegetarian  food  might  look  like  the  following 
formula. 

Serves  (Maharani, VegetarianFood) 

This  FOPC  sentence  asserts  that  Seryes,  a two-place  predicate,  holds  between 
the  objects  denoted  by  the  constants  Maharani  and  VegetarianFood. 

A somewhat  different  use  of  predicates  is  illustrated  by  the  following 
typical  representation  for  a sentence  like  Maharani  is  a restaurant. 

Restaurant  ( Maharani ) 

This  is  an  example  of  a one -place  predicate  that  is  used,  not  to  relate  multiple 
objects,  but  rather  to  assert  a property  of  a single  object.  In  this  case,  it 
encodes  the  category  membership  of  Maharani.  We  should  note  that  while 
this  is  a commonplace  way  to  deal  with  categories  it  is  probably  not  the 
most  useful.  Section  14.4  will  return  to  the  topic  of  the  representation  of 
categories. 

With  the  ability  to  refer  to  objects,  to  assert  facts  about  objects,  and 
to  relate  objects  to  one  another,  we  have  the  ability  to  create  rudimentary 
composite  representations.  These  representations  correspond  to  the  atomic 
formula  level  in  Figure  14.2.  Recall  that  this  ability  to  create  composite 
meaning  representations  was  one  of  the  core  components  of  the  meaning 
structure  of  language  described  in  Section  14.2. 

This  ability  to  compose  complex  representations  is  not  limited  to  the 
use  of  single  predicates.  Larger  composite  representations  can  also  be  put 
together  through  the  use  of  logical  connectives.  As  can  be  seen  from  Figure  connec- 
14.2,  logical  connectives  give  us  the  ability  to  create  larger  representations 
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by  conjoining  logical  formulas  using  one  of  three  operators.  Consider,  for 
example,  the  following  BERP  sentence  and  one  possible  representation  for  it. 

(14.17)  I only  have  five  dollars  and  I don’t  have  a lot  of  time. 

Have(Speaker,FiveDollars)  A -iHave  (Speaker,  Lot  Of  Time) 

The  semantic  representation  for  this  example  is  built  up  in  a straightforward 
way  from  semantics  of  the  individual  clauses  through  the  use  of  the  A and 
-i  operators.  Note  that  the  recursive  nature  of  the  grammar  in  Figure  14.2 
allows  an  infinite  number  of  logical  formulas  to  be  created  through  the  use 
of  these  connectives.  Thus  as  with  syntax,  we  have  the  ability  to  create  an 
infinite  number  of  representations  using  a finite  device. 

The  Semantics  of  FOPC 

The  various  objects,  properties,  and  relations  represented  in  a FOPC  knowl- 
edge base  acquire  their  meanings  by  virtue  of  their  correspondence  to  ob- 
jects, properties,  and  relations  out  in  the  external  world  being  modeled  by 
the  knowledge  base.  FOPC  sentences  can,  therefore,  be  assigned  a value  of 
T rue  or  False  based  on  whether  the  propositions  they  encode  arc  in  accord 
with  the  world  or  not. 

Consider  the  following  example. 

(14.18)  Ay  Caramba  is  near  ICSI. 

Capturing  the  meaning  of  this  example  in  FOPC  involves  identifying  the 
Terms  and  Predicates  that  correspond  to  the  various  grammatical  elements 
in  the  sentence,  and  creating  logical  formulas  that  capture  the  relations  im- 
plied by  the  words  and  syntax  of  the  sentence.  For  this  example,  such  an 
effort  might  yield  something  like  the  following. 

N ear(LocationO  f (AyCaramba) , LocationOf(ICSI) ) 

The  meaning  of  this  logical  formula  then  arises  from  the  relationship 
between  the  terms  LocationOf  (AyCaramba),  LocationO f (ICSI),  the  predi- 
cate Near,  and  the  objects  and  relation  they  correspond  to  in  the  world  being 
modeled.  Specifically,  this  sentence  can  be  assigned  a value  of  T rue  or  False 
based  on  whether  or  not  the  real  Ay  Caramba  is  actually  close  to  ICSI  or  not. 
Of  course,  since  our  computers  rarely  have  direct  access  to  the  outside  world 
we  have  to  rely  on  some  other  means  to  determine  the  truth  of  formulas  like 
this  one. 

For  our  current  purposes,  we  will  adopt  what  is  known  as  a database 
semantics  for  determining  the  truth  of  our  logical  formulas.  Operationally, 
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atomic  formulas  arc  taken  to  be  true  if  they  arc  literally  present  in  the  knowl- 
edge base  or  if  they  can  be  inferred  from  other  formula  that  arc  in  the  knowl- 
edge base.  The  interpretations  of  formulas  involving  logical  connectives  is 
based  on  the  meaning  of  the  components  in  the  formulas  combined  with  the 
meanings  of  the  connectives  they  contain.  Figure  14.3  gives  interpretations 
for  each  of  the  logical  operators  shown  in  Figure  14.2. 


p 

Q 

P 

PAQ 

PVQ 

P^Q 

False 

T rue 

False 

False 

T rue 

False 

T rue 

False 

T rue 

T rue 

T rue 

False 

False 

T rue 

False 

T rue 

T rue 

False 

T rue 

T rue 

T rue 

Figure  14.3  Truth  table  giving  the  semantics  of  the  various  logical 
connectives. 

The  semantics  of  the  A (and),  and  ->  (not)  operators  arc  fairly  straight- 
forward, and  arc  correlated  with  at  least  some  of  the  senses  of  their  corre- 
sponding English  terms.  However,  it  is  worth  pointing  out  that  the  V (or) 
operator  is  not  disjunctive  in  the  same  way  that  the  corresponding  English 
word  is,  and  that  the  =>  (implies)  operator  is  only  loosely  based  on  any 
conmionsense  notions  of  implication  or  causation.  As  we  will  see  in  more 
detail  in  Section  14.4,  in  most  cases  it  is  safest  to  rely  directly  on  the  en- 
tries in  the  truth  table,  rather  than  on  intuitions  arising  from  the  names  of  the 
operators. 

Variables  and  Quantifiers 

We  now  have  all  the  machinery  necessary  to  return  to  our  earlier  discussion 
of  variables.  As  noted  above,  variables  are  used  in  two  ways  in  FOPC:  to  re- 
fer to  particular  anonymous  objects  and  to  refer  generically  to  all  objects  in 
a collection.  These  two  uses  arc  made  possible  through  the  use  of  operators 
known  as  quantifiers.  The  two  operators  that  arc  basic  to  FOPC  arc  the  ex-  quantifiers 
istential  quantifier,  which  is  denoted  3,  and  is  pronounced  as  “there  exists”, 
and  the  universal  quantifier,  which  is  denoted  V,  and  is  pronounced  as  “for 
all”. 

The  need  for  an  existentially  quantified  variable  is  often  signaled  by 
the  presence  of  an  indefinite  noun  phrase  in  English.  Consider  the  following 
example. 

(14.19)  a restaurant  that  serves  Mexican  food  near  ICSI. 
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Here  reference  is  being  made  to  an  anonymous  object  of  a specified  category 
with  particular  properties.  The  following  would  be  a reasonable  representa- 
tion of  the  meaning  of  such  a phrase. 

BxRestaurant  (x) 

A Serves  (x.  MexicanFood ) 

RNear(  (LocationO  f (x) , LocationOf(ICSI) ) 

The  existential  quantifier  at  the  head  of  this  sentence  instructs  us  on 
how  to  interpret  the  variable  x in  the  context  of  this  sentence.  Informally,  it 
says  that  for  this  sentence  to  be  true  there  must  be  at  least  one  object  such 
that  if  we  were  to  substitute  it  for  the  variable  x,  the  resulting  sentence  would 
be  true.  For  example,  if  AyCaramba  is  a Mexican  restaurant  near  ICSI,  then 
substituting  AyCaramba  for  x results  in  the  following  logical  formula. 

Restaurant  ( AyCaramba ) 

A Serves  ( AyCaramba , MexicanFood ) 

AN ear  ( (LocationO  f (AyCaramba) . LocationO  f (ICSI) ) 

Based  on  the  semantics  of  the  A operator,  this  sentence  will  be  true  if 
all  of  its  three  component  atomic  formulas  arc  true.  These  in  turn  will  be  true 
if  they  arc  either  present  in  the  system’s  knowledge  base  or  can  be  inferred 
from  other  facts  in  the  knowledge  base. 

The  use  of  the  universal  quantifier  also  has  an  interpretation  based  on 
substitution  of  known  objects  for  variables.  The  substitution  semantics  for 
the  universal  quantifier  takes  the  expression/or  all  quite  literally;  the  V oper- 
ator states  that  for  the  logical  formula  in  question  to  be  true  the  substitution 
of  any  object  in  the  knowledge  base  for  the  universally  quantified  variable 
should  result  in  a true  formula.  This  is  in  marked  contrast  to  the  3 operator 
which  only  insists  on  a single  valid  substitution  for  the  sentence  to  be  true. 

Consider  the  following  example. 

(14.20)  All  vegetarian  restaurants  serve  vegetarian  food. 

A reasonable  representation  for  this  sentence  would  be  something  like  the 
following. 

MxVegetarianRestaurant(x)  =A  Serves(x. VegetarianFood) 

For  this  sentence  to  be  true,  it  must  be  the  case  that  every  substitution  of  a 
known  object  for  x must  result  in  a sentence  that  is  true.  We  can  divide  up  the 
set  of  all  possible  substitutions  into  the  set  of  objects  consisting  of  vegetarian 
restaurants  and  the  set  consisting  of  everything  else.  Let  us  first  consider  the 
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case  where  the  substituted  object  actually  is  a vegetarian  restaurant;  one  such 
substitution  would  result  in  the  following  sentence. 

VegetarianRestaurant  ( Maliarani ) 

=$-  Serves (Maharani,  VegetarianFood) 

If  we  assume  that  we  know  that  the  consequent  clause, 

Serves  (Maharani,  VegetarianFood) 

is  true  then  this  sentence  as  a whole  must  be  true.  Both  the  antecedent  and 
the  consequent  have  the  value  True  and,  therefore,  according  to  the  first 
two  rows  of  Table  14.3  the  sentence  itself  can  have  the  value  True.  This 
result  will,  of  course,  be  the  same  for  all  possible  substitutions  of  Terms 
representing  vegetarian  restaurants  for  a. 

Remember,  however,  that  for  this  sentence  to  be  true  it  must  be  true 
for  all  possible  substitutions.  What  happens  when  we  consider  a substitu- 
tion from  the  set  of  a objects  that  arc  not  vegetarian  restaurants?  Consider 
the  substitution  of  a non-vegetarian  restaurant  such  as  Ay  Caramba ’s  for  the 
variable  a. 

VegetarianRestaurant  ( AyCaramba ) 

=>  Serves  (AyCaramba.  VegetarianFood) 

Since  the  antecedent  of  the  implication  is  False , we  can  determine 
from  Table  14.3  that  the  sentence  is  always  True,  again  satisfying  the  V 
constraint. 

Note,  that  it  may  still  be  the  case  that  Ay  Caramba  serves  vegetarian 
food  without  actually  being  a vegetarian  restaurant.  Note  also,  that  despite 
our  choice  of  examples,  there  arc  no  implied  categorical  restrictions  on  the 
objects  that  can  be  substituted  for  a by  this  kind  of  reasoning.  In  other  words, 
there  is  no  restriction  of  a to  restaurants  or  concepts  related  to  them.  Con- 
sider the  following  substitution. 

VegetarianRestaurant  (Carburetor) 

=>  Serves  (Carburet or,  VegetarianFood) 

Here  the  antecedent  is  still  false  and  hence  the  rule  remains  true  under  this 
kind  of  irrelevant  substitution. 

To  review,  variables  in  logical  formulas  must  be  either  existentially  (3) 
or  universally  (V)  quantified.  To  satisfy  an  existentially  quantified  variable, 
there  must  be  at  least  one  substitution  that  results  in  a true  sentence.  Sen- 
tences with  universally  quantified  variables  must  be  true  under  all  possible 
substitutions. 
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One  of  the  most  important  desiderata  given  in  Section  14.1  for  a meaning 
representation  language  is  that  it  should  support  inference  — the  ability  to 
add  valid  new  propositions  to  a knowledge  base,  or  to  determine  the  truth 
of  propositions  not  explicitly  contained  within  a knowledge  base.  This  sec- 
tion briefly  discusses  modus  ponens,  the  most  important  inference  method 
provided  by  FOPC.  Applications  of  modus  ponens  will  be  discussed  in  Chap- 
ter 18. 

Modus  ponens  is  a familial-  form  of  inference  that  corresponds  to  what 
is  informally  known  as  if-then  reasoning.  We  can  abstractly  define  modus 
ponens  as  follows,  where  a and  [3  should  be  taken  as  FOPC  formulas. 

a 

a =>•  P 

In  general,  schemas  like  this  indicate  that  the  formula  below  the  line  can 
be  inferred  from  the  formulas  above  the  line  by  some  form  of  inference. 
Modus  ponens  simply  states  that  if  the  left-hand  side  of  an  implication  rule 
is  present  in  the  knowledge  base,  then  the  right-hand  side  of  the  rule  can  be 
inferred.  In  the  following  discussions,  we  will  refer  to  the  left  hand  side  of 
an  implication  as  the  antecedent,  and  the  right-hand  side  as  the  consequent. 

As  an  example  of  a typical  use  of  modus  ponens,  consider  the  follow- 
ing example,  which  uses  a rule  from  the  last  section. 

(14.21) 

VegetarianRestaurant(Rudys) 

VxVegetarianRestaurant(x)  =>■  Serves  (x.  VegetcirianFood) 

Serves  ( Rudys , VegetcirianFood ) 

Here,  the  formula  VegetarianRestaurant  (Rudys)  matches  the  antecedent 
of  the  rule,  thus  allowing  us  to  use  modus  ponens  to  conclude 
Serves  (Rudys . VegetarianFood ) . 

Modus  ponens  is  typically  put  to  practical  use  in  one  of  two  ways:  for- 
ward chaining  and  backward  chaining.  In  forward  chaining  systems,  modus 
ponens  is  used  in  precisely  the  manner  just  described.  As  individual  facts  are 
added  to  the  knowledge  base,  modus  ponens  is  used  to  fire  all  applicable  im- 
plication rules.  In  this  kind  of  arrangement,  as  soon  as  a new  fact  is  added  to 
the  knowledge  base,  all  applicable  implication  rules  are  found  and  applied, 
each  resulting  in  the  addition  new  facts  to  the  knowledge  base.  These  new 
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propositions  in  turn  can  be  used  to  fire  implication  rules  applicable  to  them. 
The  process  continues  until  no  further  facts  can  be  deduced. 

The  forward  chaining  approach  has  the  advantage  that  facts  will  be 
present  in  the  knowledge  base  when  needed,  since  in  a sense  all  inference 
is  performed  in  advance.  This  can  substantially  reduce  the  time  needed  to 
answer  subsequent  queries  since  they  should  all  amount  to  simple  lookups. 
The  disadvantage  of  this  approach  is  that  facts  may  be  inferred  and  stored 
that  will  never  be  needed.  Production  systems,  which  arc  heavily  used 
in  cognitive  modeling  work,  arc  forward  chaining  inference  systems  aug- 
mented with  additional  control  knowledge  that  governs  which  rules  arc  to  be 
fired. 

In  backward  chaining,  modus  ponens  is  run  in  reverse  to  prove  spe- 
cific propositions,  called  queries.  The  first  step  is  to  see  if  the  query  formula 
is  true  by  determining  if  it  is  present  in  the  knowledge  base.  If  it  is  not, 
then  the  next  step  is  to  search  for  applicable  implication  rules  present  in  the 
knowledge  base.  An  applicable  rule  is  one  where  the  consequent  of  the  rule 
matches  the  query  formula.  If  there  arc  such  any  such  rules,  then  the  query 
can  be  proved  if  the  antecedent  of  any  one  them  can  be  shown  to  be  true. 
Not  surprisingly,  this  can  be  performed  recursively  by  backward  chaining 
on  the  antecedent  as  a new  query.  The  Prolog  programming  language  is  a 
backward  chaining  system  that  implements  this  strategy. 

To  see  how  this  works,  let’s  assume  that  we  have  been  asked  to  verify 
the  truth  of  the  proposition  Sen’es(Rudys,VegetarianFood),  assuming  the 
facts  given  above  the  line  in  14.21.  Since  it  is  not  present  in  the  knowledge 
base,  a search  for  an  applicable  rule  is  initiated  that  results  in  the  rule  given 
above.  After  substituting,  the  constant  Rudys  for  the  variable  x,  our  next  task 
is  to  prove  the  antecedent  of  the  rule,  VegetarianRestaurant (Rudys),  which 
of  course  is  one  of  the  facts  we  arc  given. 

Note  that  it  is  critical  to  distinguish  between  reasoning  via  backward 
chaining  from  queries  to  known  facts,  and  reasoning  backwards  from  known 
consequents  to  unknown  antecedents.  To  be  specific,  by  reasoning  back- 
wards we  mean  that  if  the  consequent  of  a rule  is  known  to  be  true,  we  as- 
sume that  the  antecedent  will  be  as  well.  For  example,  let’s  assume  that  we 
know  that  Serves(Rudys,VegetarianFood ) is  true.  Since  this  fact  matches 
the  consequent  of  our  rule,  we  might  reason  backwards  to  the  conclusion 
that  VegetarianRestaurant  (Rudys) . 

While  backward  chaining  is  a sound  method  of  reasoning,  reasoning 
backwards  is  an  invalid,  though  frequently  useful,  form  of  plausible  rea- 
soning. Plausible  reasoning  from  consequents  to  antecedents  is  known  as 
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abduction,  and  as  we  will  see  in  Chapter  1 8 is  often  useful  in  accounting  for 
many  of  the  inferences  people  make  while  analyzing  extended  discourses. 

While  forward  and  backward  reasoning  arc  sound,  neither  is  complete. 
This  means  that  there  arc  valid  inferences  that  can  not  be  found  by  sys- 
tems using  these  methods  alone.  Fortunately,  there  is  an  alternative  infer- 
ence technique  called  resolution  that  is  sound  and  complete.  Unfortunately, 
inference  systems  based  on  resolution  arc  far  more  computationally  expen- 
sive than  forward  or  backward  chaining  systems.  In  practice,  therefore,  most 
systems  use  some  form  of  chaining,  and  place  a burden  on  knowledge  base 
developers  to  encode  the  knowledge  in  a fashion  that  permits  the  necessary 
inferences  to  be  drawn. 

Some  Linguistically  Relevant  Concepts 

Entire  lives  have  been  spent  studying  the  representation  of  various  aspects 
of  human  knowledge.  These  efforts  have  ranged  from  tightly  focused  ef- 
forts to  represent  individual  domains  such  as  time,  to  monumental  efforts  to 
encode  all  of  our  conmionsense  knowledge  of  the  world  (Lenat  and  Guha, 
1991).  Our  focus  here  is  considerably  more  modest.  This  section  provides  a 
brief  overview  of  the  representation  of  a few  important  topics  that  have  clear 
implications  for  language  processing.  Specifically,  the  following  sections 
provide  introductions  to  the  meaning  representations  of  categories,  events, 
time,  and  beliefs. 

Categories 

As  we  noted  in  Section  14.2,  words  with  predicate-like  semantics  often  ex- 
press preferences  for  the  semantics  of  their  arguments  in  the  form  of  selec- 
tion restrictions.  These  restrictions  arc  typically  expressed  in  the  form  of 
semantically-based  categories  where  all  the  members  of  a category  share  a 
set  of  relevant  features. 

The  most  common  way  to  represent  categories  is  to  create  a unary 
predicate  for  each  category  of  interest.  Such  predicates  can  then  be  asserted 
for  each  member  of  that  category.  For  example,  in  our  restaurant  discussions 
we  have  been  using  the  unary  predicate  VegetarianRestaurant  as  in: 

VegetarianRestaurant  ( Maharani ) 

Similar  logical  formulas  would  be  included  in  our  knowledge  base  for 
each  known  vegetarian  restaurant. 


Section  14.4.  Some  Linguistically  Relevant  Concepts 


519 


Unfortunately,  in  this  method  categories  arc  relations,  rather  than  full- 
fledged  objects.  It  is,  therefore,  difficult  to  make  assertions  about  categories 
themselves,  rather  than  about  their  individual  members.  For  example,  we 
might  want  to  designate  the  most  popular  member  of  a given  category  as  in 
the  following  expression. 

Most  Po  pular[  Maharani . VegetarianRestaurant ) 

Unfortunately,  this  is  not  a legal  FOPC  formula  since  the  arguments  to  pred- 
icates in  FOPC  must  be  Terms,  not  other  predicates. 

One  way  to  solve  this  problem  is  to  represent  all  the  concepts  that 
we  want  to  make  statements  about  as  full-fledged  objects  via  a tech- 
nique called  reification.  In  this  case,  we  can  represent  the  category  of  reification 
VegetarianRestaurant  as  an  object  just  as  Maharani  is.  The  notion  of  mem- 
bership in  such  a category  is  then  denoted  via  a membership  relation  as  in 
the  following. 

ISA  ( Maharani , VegetarianRestaurant ) 

The  relation  denoted  by  ISA  (is  a)  holds  between  objects  and  the  cate- 
gories in  which  they  arc  members.  This  technique  can  be  extended  to  create 
hierarchies  of  categories  through  the  use  of  other  similar  relations,  as  in  the 
following. 

AKO  (VegetarianRestaurant,  Restaur  ant) 

Here,  the  relation  AKO  (a  kind  of)  holds  between  categories  and  denotes 
a category  inclusion  relationship.  Of  course,  to  truly  give  these  predicates 
meaning  they  would  have  to  be  situated  in  a larger  set  of  facts  defining  cate- 
gories as  sets. 

Chapter  16  discusses  the  practical  use  of  such  relations  in  databases  of 
lexical  relations,  in  the  representation  of  selection  restrictions,  and  in  word 
sense  disambiguation. 

Events 

The  representations  for  events  that  we  have  used  until  now  have  consisted  of 
single  predicates  with  as  many  arguments  as  arc  needed  to  incorporate  all  the 
roles  associated  with  a given  example.  For  example,  the  representation  for 
making  a reservation  discussed  in  Section  14.2  consisted  of  a single  pred- 
icate with  arguments  for  the  person  making  the  reservation,  the  restaurant, 
the  day,  the  time,  and  the  number  of  people  in  the  party,  as  in  the  following. 

Reservation)!!  earer,  Maharani,  Today , 8PM,  2) 
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In  the  case  of  verbs,  this  approach  simply  assumes  that  the  predicate  rep- 
resenting the  meaning  of  a verb  has  the  same  number  of  arguments  as  arc 
present  in  the  verb’s  syntactic  subcategorization  frame. 

Unfortunately,  there  arc  three  problems  with  this  approach  that  make 
it  awkward  to  apply  in  practice: 

• Determining  the  correct  number  of  roles  for  any  given  event. 

• Representing  facts  about  the  roles  associated  with  an  event. 

• Ensuring  that  all  the  correct  inferences  can  be  derived  directly  from  the 
representation  of  an  event. 

• Ensuring  that  no  incorrect  inferences  can  be  derived  from  the  represen- 
tation of  an  event. 

We  will  explore  these,  and  other  related  issues,  by  considering  a series 
of  representations  for  events.  This  discussion  will  focus  on  the  following 
examples  of  the  verb  eat. 


(14.22)  I ate. 

(14.23)  I ate  a turkey  sandwich. 

(14.24)  I ate  a turkey  sandwich  at  my  desk. 

(14.25)  I ate  at  my  desk. 

(14.26)  I ate  lunch. 

(14.27)  I ate  a turkey  sandwich  for  lunch. 

(14.28)  I ate  a turkey  sandwich  for  lunch  at  my  desk. 

Clearly,  the  variable  number  of  arguments  for  a predicate-bearing  verb 
like  eat  poses  a tricky  problem.  While  we  would  like  to  think  that  all  of  these 
arity  examples  denote  the  same  kind  of  event,  predicates  in  FOPC  have  fixed  arity 
— they  take  a fixed  number  of  arguments. 

One  possible  solution  is  suggested  by  the  way  that  examples  like  these 
arc  handled  syntactically.  The  solution  given  in  Chapter  1 1 was  to  create 
one  subcategorization  frame  for  each  of  the  configurations  of  arguments  that 
a verb  allows.  The  semantic  analog  to  this  approach  is  to  create  as  many 
different  eating  predicates  as  arc  needed  to  handle  all  of  the  ways  that  eat 
behaves.  Such  an  approach  would  yield  the  following  kinds  of  representa- 
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tions  for  Examples  14.22  through  14.22. 

Eating  \ ( Speaker ) 

Eating2  ( Speaker , T urkeySandwich) 

Eatings  ( Speaker , T urkeySandwich , Desk) 

Eatings  ( Speaker , D<?.vk ) 

Eatings  ( Speaker ■ Lunch ) 

Eatings  ( Speaker , T urkeySandwich, Lunch) 

Eating 7 ( Speaker , T urkeySandwich,  Lunch.  Desk) 

This  approach  simply  sidesteps  the  issue  of  how  many  arguments  the 
Eating  predicate  should  have  by  creating  distinct  predicates  for  each  of  the 
subcategorization  frames.  Unfortunately,  this  approach  comes  at  a rather 
high  cost.  Other  than  the  suggestive  names  of  the  predicates,  there  is  noth- 
ing to  tie  these  events  to  one  another  even  though  there  arc  obvious  logical 
relations  among  them.  Specifically,  if  Example  14.28  is  true  then  all  of  the 
other  examples  arc  true  as  well.  Similarly,  if  Example  14.27  is  true  then 
Examples  14.22,  14.23  and  14.26  must  also  be  true.  Such  logical  connec- 
tions can  not  be  made  on  the  basis  of  these  predicates  alone.  Moreover,  we 
would  expect  a commonsense  knowledge  base  to  contain  logical  connections 
between  concepts  like  Eating  and  related  concepts  like  Hunger  and  Food. 

One  method  to  solve  these  problems  involves  the  use  of  what  arc  called 
meaning  postulates.  Consider  the  following  example  postulate. 

\/w,x,y,z Eatingj(w,x,y,z)  =A  Eatinge(w,x,y) 

This  postulate  explicitly  ties  together  the  semantics  of  two  of  our  predicates. 
Other  postulates  could  be  created  to  handle  the  rest  of  the  logical  relations 
among  the  various  Eatings  and  the  connections  from  them  to  other  related 
concepts. 

Although  such  an  approach  might  be  made  to  work  in  small  domains, 
it  clearly  has  scalability  problems.  A somewhat  more  sensible  approach  is  to 
say  that  Examples  14.22  through  14.28  all  reference  the  same  predicate  with 
some  of  the  arguments  missing  from  some  of  the  surface  forms.  Under  this 
approach,  as  many  arguments  arc  included  in  the  definition  of  the  predicate 
as  ever  appeal-  with  it  in  an  input.  Adopting  the  structure  of  a predicate 
like  Eating 7 as  an  example  would  give  us  a predicate  with  four  arguments 
denoting  the  eater,  thing  eaten,  meal  being  eaten  and  the  location  of  the 
eating.  The  following  formulas  would  then  capture  the  semantics  of  our 
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examples. 

3w,x,y  Eating{Speaker,w,x,y) 

3w,x  Eating  {Speaker, Turkey  Sandwich,  w.x) 

3w  Ealing  {Speaker.  T urkey  Sandwich, w,  Desk ) 

3w,x  Eating  {Speaker,  w,  x , Desk ) 

3w,x  Eating  {Speaker,  w,  Lunch,  x) 

3w  Eating  {Speaker , T urkey  Sandwich,  Lunch,  w) 
Eating{Speaker,TurkeySandwich, Lunch, Desk) 

This  approach  directly  yields  the  obvious  logical  connections  among 
these  formulas  without  the  use  of  meaning  postulates.  Specifically,  all  of  the 
sentences  with  ground  terms  as  arguments  logically  imply  the  truth  of  the 
formulas  with  existentially  bound  variables  as  arguments. 

Unfortunately,  this  approach  still  has  at  least  two  glaring  deficiencies: 
it  makes  too  many  commitments,  and  it  does  not  let  us  individuate  events. 
As  an  example  of  how  it  makes  too  many  commitments,  consider  how  we 
accommodated  the  for  lunch  complement  in  Examples  14.26  through  14.28; 
a third  argument,  the  meal  being  eaten,  was  added  to  the  Eating  predicate. 
The  presence  of  this  argument  implicitly  makes  it  the  case  that  all  eating 
events  are  associated  with  a meal  (ie.  breakfast,  lunch,  or  dinner).  More 
specifically,  the  existentially  quantified  variable  for  the  meal  argument  in  the 
above  examples  states  that  there  is  some  formal  meal  associated  with  each 
of  these  eatings.  This  is  clearly  silly  since  one  can  certainly  eat  something 
independent  of  it  being  associated  with  a meal. 

To  see  how  this  approach  fails  to  properly  individuate  events,  consider 
the  following  formulas. 

3w,x  Eating  {Speaker,  w,  x , Desk ) 

3w,x  Eating{Speaker,  w, Lunch, x) 

3w,x  Eat ing {Speaker,  w,  Lunch , Desk ) 

If  we  knew  that  the  first  two  formula  were  referring  to  the  same  event,  they 
could  be  combined  to  create  the  third  representation.  Unfortunately,  with 
the  current  representation  we  have  no  way  of  telling  if  this  is  possible.  The 
independent  facts  that  I ate  at  my  desk  and  I ate  lunch  do  not  permit  us  to 
conclude  that  I ate  lunch  at  my  desk.  Clearly  what  is  lacking  is  some  way  of 
referring  to  the  events  in  question. 

As  with  categories,  we  can  solve  these  problems  if  we  employ  reifica- 
tion to  elevate  events  to  objects  that  can  be  quantified  and  related  to  a other 
objects  via  sets  of  defined  relations  (Davidson,  1967;  Parsons,  1990).  Con- 
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sider  the  representation  of  Example  14.23  under  this  kind  of  approach. 

3w  ISA(w,  Eating) 

AE  at  er(w,  Speaker)  AEaten(w , Turkey  Sandwich) 

This  representation  states  that  there  is  an  eating  event  where  the  Speaker 
is  doing  the  eating  and  a Turkey  Sandwich  is  being  eaten.  The  meaning  rep- 
resentations for  Examples  14.22  and  14.27  can  be  constructed  similarly. 

3w  ISA(w, Eating)  AEater(w, Speaker) 

3w  I SA{w, Eating) 

AE  at  er(w.  Speaker)  AEaten(w , Turkey  Sandwich) 

AM eal Eaten  ( w , Lunch) 

Under  this  reified-event  approach: 

• There  is  no  need  to  specify  a fixed  number  of  arguments  for  a given 
surface  predicate,  rather  as  many  roles  and  tillers  can  be  glued  on  as 
appeal-  in  the  input. 

• No  more  roles  are  postulated  than  are  mentioned  in  the  input. 

• The  logical  connections  among  closely  related  examples  is  satisfied 
without  the  need  for  meaning  postulates. 

Representing  Time 

In  the  preceding  discussion  of  events,  we  did  not  address  the  issue  of  repre- 
senting the  time  when  the  represented  events  are  supposed  to  have  occurred. 
The  representation  of  such  information  in  a useful  form  is  the  domain  of 
temporal  logic.  This  discussion  will  serve  to  introduce  the  most  basic  con- 
cerns of  temporal  logic  along  with  a brief  discussion  of  the  means  by  which 
human  languages  convey  temporal  information,  which  among  other  things 
includes  tense  logic,  the  ways  that  verb  tenses  convey  temporal  information. 

The  most  straightforward  theory  of  time  hold  that  it  flows  inexorably 
forward,  and  that  events  are  associated  with  either  points  or  intervals  in  time, 
as  on  a timeline.  Given  these  notions,  an  ordering  can  be  imposed  on  distinct 
events  by  situating  them  on  the  timeline.  More  specifically,  we  can  say  that 
one  event  precedes  another,  if  the  flow  of  time  leads  from  the  first  event  to 
the  second.  Accompanying  these  notions  in  most  theories  is  the  idea  of  the 
current  moment  in  time.  Combining  this  notion  with  the  idea  of  a temporal 
ordering  relationship  yields  the  familial'  notions  of  past,  present  and  future. 

Not  surprisingly,  there  are  a large  number  of  schemes  for  representing 
this  kind  of  temporal  information.  The  one  presented  here  is  a fairly  simple 
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one  that  stays  within  the  FOPC  framework  of  reified  events  that  we  have  been 
pursuing.  Consider  the  following  examples. 

(14.29)  I arrived  in  New  York. 

(14.30)  I am  arriving  in  New  York. 

(14.31)  I will  arrive  in  New  York. 

These  sentences  all  refer  to  the  same  kind  of  event  and  differ  solely  in  the 
tense  of  the  verb.  In  our  current  scheme  for  representing  events,  all  three 
would  share  the  following  kind  of  representation,  which  lacks  any  temporal 
information. 

3w  I SA(w,  Arriving) 

f\Arriver(w,  Speaker)  A Dest inat ion  ( w.  N ewYork ) 

The  temporal  information  provided  by  the  tense  of  the  verbs  can  be 
exploited  by  predicating  additional  information  about  the  event  variable  w. 
Specifically,  we  can  add  temporal  variables  representing  the  interval  corre- 
sponding to  the  event,  the  end  point  of  the  event,  and  temporal  predicates 
relating  this  end  point  to  the  current  time  as  indicated  by  the  tense  of  the 
verb.  Such  an  approach  yields  the  following  representations  for  our  arriving 
examples. 

3;, e.  w,  t ISA  (w.  Arriving) 

AArriver(w.  Speaker)  A Dest  inat  ion  ( w.  N ewYork) 
IntervalOf(w,i)  A End  Point  (i,e)  APrecedes(e:Now) 

3 i.  e.  w,  t ISA  ( w. Arriving ) 

AArriver(w.  Speaker)  A Dest  inat  ion  ( w.  N ewYork) 
IntervalOf(w. i)  AMemberO f (i ,N ow) 

3i,  e.  w,  t ISA  ( w. Arriving ) 

AArriver(w.  Speaker)  A Dest  inat  ion  ( w.  N ewYork ) 

Interval  Of  (w,i)  A EnclPoint  (i.e)  A Precedes  (Now.  e) 

This  representation  introduces  a variable  to  stand  for  the  interval  of  time  as- 
sociated with  the  event,  and  a variable  that  stands  for  the  end  of  that  interval. 
The  two-place  predicate  Precedes  represents  the  notion  that  the  first  time 
point  argument  precedes  the  second  in  time;  the  constant  Now  refers  to  the 
current  time.  For  past  events,  the  end  point  of  the  interval  must  precede  the 
current  time.  Similarly,  for  future  events  the  current  time  must  precede  the 
end  of  the  event.  For  events  happening  in  the  present,  the  current  time  is 
contained  within  the  event  interval. 
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Unfortunately,  the  relation  between  simple  verb  tenses  and  points  in 
time  is  by  no  means  straightforward.  Consider  the  following  examples. 

(14.32)  Ok,  we  fly  from  San  Francisco  to  Boston  at  10. 

(14.33)  Flight  1390  will  be  at  the  gate  an  hour  now. 

In  the  first  example,  the  present  tense  of  the  verb  fly  is  used  to  refer  to  a 
future  event,  while  in  the  second  the  future  tense  is  used  to  refer  to  a past 
event. 

More  complications  occur  when  we  consider  some  of  the  other  verb 
tenses.  Consider  the  following  examples. 

(14.34)  Flight  1902  arrived  late. 

(14.35)  Flight  1902  had  arrived  late. 

Although  both  refer  to  events  in  the  past,  representing  them  in  the  same  way 
seems  wrong.  The  second  example  seems  to  have  another  unnamed  event 
lurking  in  the  background  (eg.  Flight  1902  had  already  arrived  late  when 
something  else  happened).  To  account  for  this  phenomena,  Reichenbach 
(1947)  introduced  the  notion  of  a reference  point.  In  our  simple  temporal 
scheme,  the  current  moment  in  time  is  equated  with  the  time  of  the  utterance, 
and  is  used  as  a reference  point  for  when  the  event  occurred  (before,  at, 
or  after).  In  Reichenbach’s  approach,  the  notion  of  the  reference  point  is 
separated  out  from  the  utterance  time  and  the  event  time.  The  following 
examples  illustrate  the  basics  of  this  approach. 

(14.36)  When  Mary’s  flight  departed.  I ate  lunch. 

(14.37)  When  Mary’s  flight  departed,  I had  eaten  lunch. 

In  both  of  these  examples,  the  eating  event  has  happened  in  the  past, 
ie.  prior  to  the  utterance.  However,  the  verb  tense  in  the  first  example  indi- 
cates that  the  eating  event  began  when  the  flight  departed,  while  the  second 
example  indicates  that  the  eating  was  accomplished  prior  to  the  flight’s  de- 
parture. Therefore,  in  Reichenbach’s  terms  the  departure  event  specifies  the 
reference  point.  These  facts  can  be  accommodated  by  asserting  additional 
constraints  relating  the  eating  and  departure  events.  In  the  first  example,  the 
reference  point  precedes  the  eating  event,  and  in  the  second  example,  the 
eating  precedes  the  reference  point.  Figure  14.4  illustrates  Reichenbach’s 
approach  with  the  primary  English  tenses.  Exercise  14.9  asks  you  to  repre- 
sent these  examples  in  FOPC. 

This  discussion  has  focused  narrowly  on  the  broad  notions  of  past, 
present,  and  future  and  how  they  arc  signaled  by  verb  tenses.  Of  course, 
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Past  Perfect 
I had  eaten. 

Simple  Past 
I ate. 

Present  Perfect 
I have  eaten. 

E R U 

i 

R,E  U 

E RU 

Present 

Simple  Future 

Future  Perfect 

I eat. 

I will  eat. 

I will  have  eaten. 

1 

U,R,E 

1 1 * 

U,R  E 

ill* 

Figure  14.4  Reichenbach’s  approach  applied  to  various  English  tenses.  In 
these  diagrams,  time  flows  from  left  to  right,  an  E denotes  the  time  of  the  event, 
an  R denotes  the  reference  time,  and  an  U denotes  the  time  of  the  utterance. 


languages  also  have  many  other  more  direct  and  more  specific  ways  to  con- 
vey temporal  information,  including  the  use  of  a wide  variety  of  temporal 
expressions  as  in  the  following  ATIS  examples. 

(14.38)  I'd  like  to  go  at  6:45,  in  the  morning. 

(14.39)  Somewhere  around  noon,  please. 

(14.40)  Later  in  the  afternoon,  near  6pm. 

As  we  will  see  in  the  next  chapter,  grammars  for  such  temporal  expressions 
arc  of  considerable  practical  importance  in  information  extraction  and  ques- 
tion answering  applications. 

Finally,  we  should  note  that  there  is  a systematic  conceptual  organiza- 
tion reflected  in  examples  like  these.  In  particular,  temporal  expressions  in 
English  arc  frequently  expressed  in  spatial  terms,  as  is  illustrated  by  the  var- 
ious uses  of  at,  in,  somewhere  and  near  in  these  examples  (Lakoff  and  John- 
son, 1980;  Jackendoff,  1983a).  Metaphorical  organizations  such  as  these, 
where  one  domain  is  systematically  expressed  in  terms  of  another,  will  be 
discussed  in  more  detail  in  Chapter  16. 

Aspect 

In  the  last  section,  we  discussed  ways  to  represent  the  time  of  an  event  with 
respect  to  the  time  of  an  utterance  describing  it.  In  this  section,  we  address 
aspect  the  notion  of  aspect,  which  concerns  a cluster  of  related  topics,  including 
whether  an  event  has  ended  or  is  ongoing,  whether  it  is  conceptualized  as 
happening  at  a point  in  time  or  over  some  interval,  and  whether  or  not  any 
particulai-  state  in  the  world  comes  about  because  of  it.  Based  on  these  and 
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related  notions,  event  expressions  have  traditionally  been  divided  into  four 
general  classes:  statives,  activities,  accomplishments,  and  achievements. 

The  following  examples  provide  prototypical  instances  of  each  class. 

Stative:  I know  my  departure  gate. 

Activity:  John  is  flying. 

Accomplishment:  Sally  booked  her  flight. 

Achievement:  She  found  her  gate. 

Although  the  earliest  versions  of  this  classification  were  discussed  by  Aristo- 
tle, the  one  presented  here  is  due  to  Vendler  (1967).  In  the  following  discus- 
sion, we'll  present  a brief  characterization  of  each  of  the  four  classes,  along 
with  some  diagnostic  techniques  suggested  in  Dowty  (1979)  for  identifying 
examples  of  each  kind. 

Stative  expressions  represent  the  notion  of  an  event  participant  having  stative 
a particular  property,  or  being  in  a state,  at  a given  point  in  time.  As  such, 
they  can  be  thought  of  as  capturing  an  aspect  of  a world  at  a single  point  in 
time.  Consider  the  following  ATIS  examples. 

(14.41)  I like  Flight  840  arriving  at  10:06. 

(14.42)  I need  the  cheapest  fare. 

(14.43)  I have  a round  trip  ticket  for  $662. 

(14.44)  I want  to  go  first  class. 

In  examples  like  these,  the  event  participant  denoted  by  the  subject  can  be 
seen  as  experiencing  something  at  a specific  point  in  time.  Whether  or  not 
the  experiencer  was  in  the  same  state  earlier,  or  will  be  in  the  future  is  left 
unspecified. 

There  arc  a number  of  diagnostic  tests  for  identifying  statives.  As  an 
example,  stative  verbs  arc  distinctly  odd  when  used  in  the  progressive  form. 

(14.45)  *1  am  needing  the  cheapest  fare  on  this  day. 

(14.46)  *1  am  wanting  to  go  first  class. 

We  should  note  that  in  these  and  subsequent  examples,  we  arc  using  an  * to 
indicate  a broadened  notion  of  ill-formedness  that  may  include  both  semantic 
and  syntactic  factors. 

Statives  arc  arc  also  odd  when  used  as  imperatives. 

(14.47)  *Need  the  cheapest  fare! 

Finally,  statives  arc  not  easily  modified  by  adverbs  like  deliberately 
and  carefully. 
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(14.48)  *1  deliberately  like  Flight  840  arriving  at  10:06. 

(14.49)  *1  carefully  like  Flight  840  arriving  at  10:06. 

Activity  expressions  describe  events  undertaken  by  a participant  that 
have  no  particular  end-point.  Unlike  statives,  activities  arc  seen  as  occurring 
over  some  span  of  time,  and  arc  therefore  not  associated  with  single  points 
in  time.  Consider  the  following  examples. 

(14.50)  She  drove  a Mazda. 

(14.51)  I live  in  Brooklyn. 

These  examples  both  specify  that  the  subject  is  engaged  in,  or  has  engaged 
in,  the  activity  specified  by  the  verb  for  some  period  of  time. 

Unlike  statives,  activity  expressions  arc  fine  in  both  the  progressive  and 
imperative  forms. 

(14.52)  She  is  living  in  Brooklyn. 

(14.53)  Drive  a Mazda! 

However,  like  statives,  activity  expressions  arc  odd  when  temporally 
modified  with  temporal  expressions  using  in. 

(14.54)  *1  live  in  Brooklyn  in  a month. 

(14.55)  *She  drove  a Mazda  in  an  hour. 

They  can,  however,  successfully  be  used  with  for  temporal  adverbials,  as  in 
the  following  examples. 

(14.56)  I live  in  Brooklyn  for  a month. 

(14.57)  She  drove  a Mazda  for  an  hour. 

Unlike  activities,  accomplishment  expressions  describe  events  that 
have  a natural  end-point  and  result  in  a particular  state.  Consider  the  fol- 
lowing examples. 

(14.58)  He  booked  me  a reservation. 

(14.59)  United  flew  me  to  New  York. 

In  these  examples,  there  is  an  event  that  is  seen  as  occurring  over  some  period 
of  time  that  ends  when  the  intended  state  is  accomplished. 

A number  of  diagnostics  can  be  used  to  distinguish  accomplishment 
events  from  activities.  Consider  the  following  examples,  which  make  use  of 
the  word  stop  as  a test. 

(14.60)  I stopped  living  in  Brooklyn. 

(14.61)  She  stopped  booking  my  flight. 
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In  the  first  example,  which  is  an  activity,  one  can  safely  conclude  that  the 
statement  I lived  in  Brooklyn  even  though  this  activity  came  to  an  end.  How- 
ever, from  the  second  example,  one  can  not  conclude  the  statement  She 
hooked  her  flight,  since  the  activity  was  stopped  before  the  intended  state  was 
accomplished.  Therefore,  although  stopping  an  activity  entails  that  the  ac- 
tivity took  place,  stopping  an  accomplishment  event  indicates  that  the  event 
did  not  succeed. 

Activities  and  accomplishments  can  also  be  distinguished  by  by  how 
they  can  be  modified  by  various  temporal  adverbials.  Consider  the  following 
examples. 

(14.62)  *1  lived  in  Brooklyn  in  a year. 

(14.63)  She  booked  a flight  in  a minute. 

In  general,  accomplishments  can  be  modified  by  in  temporal  expressions, 
while  simple  activities  can  not. 

The  final  aspectual  class,  achievements,  arc  similar  to  accomplish- 
ments in  that  they  result  in  a state.  Consider  the  following  examples. 

(14.64)  She  found  her  gate. 

(14.65)  I reached  New  York. 

Unlike  accomplishments,  achievement  events  arc  thought  of  as  happening  in 
an  instant,  and  arc  not  equated  with  any  particular'  activity  leading  up  to  the 
state.  To  be  more  specific,  the  events  in  these  examples  may  have  been  pre- 
ceded by  extended  searching  or  traveling  events,  but  the  events  correspond- 
ing directly  to  found  and  reach  are  conceived  of  as  points  not  intervals. 

The  point-like  nature  of  these  events  has  implications  for  how  they  can 
be  temporally  modified.  In  particular,  consider  the  following  examples. 

(14.66)  I lived  in  New  York  for  a year. 

(14.67)  *1  reached  New  York  for  a few  minutes. 

Unlike  activity  and  accomplishment  expressions,  achievements  can  not  be 
modified  by  for  adverbials. 

Achievements  can  also  be  distinguished  from  accomplishments  by  em- 
ploying the  word  stop,  as  we  did  earlier.  Consider  the  following  examples. 

(14.68)  I stopped  booking  my  flight. 

(14.69)  *1  stopped  reaching  New  York. 

As  we  saw  earlier,  using  stop  with  an  accomplishment  expression  results 
in  a failure  to  reach  the  intended  state.  Note,  however,  that  the  resulting 
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expression  is  perfectly  well-formed.  On  the  other  hand,  using  stop  with  an 
achievement  example  is  unacceptable. 

We  should  note  that  since  both  accomplishments  and  achievements  arc 
events  that  result  in  a state,  they  arc  sometimes  characterized  as  sub-types  of 
a single  aspectual  class.  Members  of  this  combined  class  arc  known  as  telic 
tuauteI"1'  eventualities. 

Before  moving  on,  we  should  make  two  points  about  this  classification 
scheme.  The  first  point  is  that  event  expressions  can  easily  be  shifted  from 
one  class  to  another.  Consider  the  following  examples. 

(14.70)  I flew. 

(14.71)  I flew  to  New  York. 

The  first  example  is  a simple  activity;  it  has  no  natural  end-point  and  can 
not  be  temporally  modified  by  in  temporal  expressions.  On  the  other  hand, 
the  second  example  is  clearly  an  accomplishment  event  since  it  has  an  end- 
point, results  in  a particular  state,  and  can  be  temporally  modified  in  all  the 
ways  that  accomplishments  can.  Clearly  the  classification  of  an  event  is  not 
solely  governed  by  the  verb,  but  by  the  semantics  of  the  entire  expression  in 
context. 

The  second  point  is  that  while  classifications  such  as  this  one  arc  often 
useful,  they  do  not  explain  why  it  is  that  events  expressed  in  natural  lan- 
guages fall  into  these  particular  classes.  We  will  revisit  this  issue  in  Chap- 
ter 16  where  we  will  sketch  a representational  approach  due  to  Dowty  (1979) 
that  accounts  for  these  classes. 

Representing  Beliefs 

There  arc  a fair  number  of  words  and  expressions  that  have  what  might  be 
called  a world  creating  ability.  By  this,  we  mean  that  their  meaning  repre- 
sentations contain  logical  formulas  that  arc  not  intended  to  taken  as  true  in 
the  real  world,  but  rather  as  paid  of  some  kind  of  hypothetical  world.  In  addi- 
tion, these  meaning  representations  often  denote  a relation  from  the  speaker, 
or  some  other  entity,  to  this  hypothetical  world.  Examples  of  words  that 
have  this  ability  arc  believe,  want,  imagine  and  know.  World-creating  words 
generally  take  various  sentence-like  constituents  as  arguments. 

Consider  the  following  example. 

(14.72)  I believe  that  Mary  ate  British  food. 

Applying  our  event-oriented  approach  we  would  say  that  there  two  events 
underlying  this  sentence:  a believing  event  relating  the  speaker  to  some  spe- 


Section  14.4.  Some  Linguistically  Relevant  Concepts 


531 


cific  belief,  and  an  eating  event  that  plays  the  role  of  the  believed  thing. 
Ignoring  temporal  information,  a straightforward  application  of  our  reified 
event  approach  would  produce  the  following  kind  of  representation. 

Bu.v  ISA(u, Believing)  AlSA(v, Eating) 

f\Believer(u, Speaker)  A BelievedProp(u.v) 

A E ater( v,  Mary ) A Eaten  ( v.  Brit ishFood ) 

This  seems  relatively  straightforward,  all  the  right  roles  arc  present  and 
the  two  events  arc  tied  together  in  a reasonable  way.  Recall,  however,  that 
in  conjunctive  representations  like  this  all  of  the  individual  conjuncts  must 
be  taken  to  be  true.  In  this  case,  this  results  in  a statement  that  there  actually 
was  an  eating  of  British  food  by  Mary.  Specifically,  by  breaking  this  for- 
mula apart  into  separate  formulas  by  conjunction  elimination  the  following 
formula  can  be  produced. 

3v  ISA(v,  Eating) 

A E ater(  v,  Mary ) A Eaten  ( v.  Brit  ishFood ) 

This  is  clearly  more  than  we  want  to  say.  The  fact  that  the  speaker  believes 
this  proposition  does  not  make  it  true;  it  is  only  true  in  the  world  represented 
by  the  speaker’s  beliefs.  What  is  needed  is  a representation  that  has  a struc- 
ture similar  to  this,  but  where  the  Eating  event  is  given  a special  status. 

Note  that  reverting  to  the  simpler  predicate  representations  we  used 
earlier  in  this  chapter  does  not  help.  A common  mistake  using  such  rep- 
resentations would  be  to  represent  this  sentence  with  the  following  kind  of 
formula. 

Bel  ieving  (Speaker.  Eat  ing  ( Mary.  BritishFood)) 

The  problem  with  this  representation  is  that  it  is  not  even  valid  FOPC.  The 
second  argument  to  the  Believing  predicate  should  be  a FOPC  term,  not  a 
formula.  This  syntactic  error  reflects  a deeper  semantic  problem.  Predicates 
in  FOPC  hold  between  the  objects  in  the  domain  being  modeled,  not  between 
the  relations  that  hold  among  the  objects  in  the  domain.  Therefore,  FOPC 
lacks  a meaningful  way  to  assert  relations  about  full  propositions,  which  is 
unfortunately  exactly  what  words  like  believe,  want,  imagine  and  know  want 
to  do. 

The  standard  method  for  handling  this  situation  is  to  augment  FOPC 
with  operators  that  allow  us  to  make  statements  about  full  logical  formu- 
las. Let’s  consider  how  this  approach  might  work  in  the  case  of  Example 
14.72.  We  can  introduce  an  operator  called  Believes  that  takes  two  FOPC 
formulas  as  its  arguments:  a formula  designating  a believer,  and  a formula 
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designating  the  believed  proposition.  Applying  this  operator  would  result  in 
the  following  meaning  representation. 

Bel ieves  ( S peaker.  3 vISA  ( v.Eat  ing ) 

A E ater(v , Mary ) A Eaten  ( v.  Brit ishFood ) 

Under  this  approach,  the  contribution  of  the  word  believes  to  this  mean- 
ing representation  is  not  a FOPC  proposition  at  all,  but  rather  an  operator  that 
is  applied  to  the  believed  proposition.  Therefore,  as  we  discuss  in  Chap- 
ter 15,  these  world  creating  verbs  play  quite  a different  role  in  the  semantic 
analysis  than  more  ordinary  verbs  like  eat. 

As  one  might  expect,  keeping  track  of  who  believes  what  about  whom 
at  any  given  point  in  time  gets  rather  complex.  As  we  will  see  in  Chapter  18, 
this  is  an  important  task  in  interactive  systems  that  must  track  users’  beliefs 
as  they  change  during  the  course  of  a dialog. 

Operators  like  Believes  that  apply  to  logical  formulas  arc  known  as 
modal  operators.  Correspondingly,  a logic  augmented  with  such  operators 
is  known  as  a modal  logic.  Modal  logics  have  found  many  uses  in  the  rep- 
resentation of  commonsense  knowledge  in  addition  to  the  modeling  of  be- 
lief, among  the  more  prominent  arc  representations  of  time  and  hypothetical 
worlds. 

Not  surprisingly,  modal  operators  and  modal  logics  raise  a host  of  com- 
plex theoretical  and  practical  problems  that  we  can  not  even  begin  to  do  jus- 
tice to  here.  Among  the  more  important  issues  arc  the  following. 

• How  inference  works  in  the  presence  of  specific  modal  operators. 

• The  kinds  of  logical  formula  that  particular  operators  can  be  applied 
to. 

• How  modal  operators  interact  with  quantifiers  and  logical  connectives. 

• The  influence  of  these  operators  on  the  equality  of  terms  across  formu- 
las. 

The  last  issue  in  this  list  has  consequences  for  modeling  agent’s  knowl- 
edge and  beliefs  in  dialog  systems  and  deserves  some  elaboration  here.  In 
standard  FOPC  systems,  logical  terms  that  arc  known  to  be  equal  to  one  an- 
other can  be  freely  substituted  without  having  any  effect  on  the  truth  of  sen- 
tences they  occur  in.  Consider  the  following  examples 

(14.73)  Snow  has  delayed  Flight  1045. 

(14.74)  John's  sister’s  flight  serves  dinner. 

Assuming  that  these  two  flights  arc  the  same,  substituting  Flight  1045  for 
John’s  sister’s  flight  has  no  effect  on  the  truth  of  either  sentence. 
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Now  consider,  the  following  variation  on  the  first  example. 

(14.75)  John  knows  that  snow  has  delayed  Flight  1045. 

(14.76)  John  knows  that  his  sister’s  flight  serves  dinner. 

Here  the  substitution  does  not  work.  John  may  well  know  that  Flight  1045 
has  been  delayed  without  knowing  that  his  sister’s  flight  is  delayed,  simply 
because  he  may  not  know  the  number  of  his  sister’s  flight.  In  other  words, 
even  if  we  assume  that  these  sentences  arc  true,  and  that  John’s  sister  is 
on  Flight  1045,  we  can  not  say  anything  about  the  truth  of  the  following 
sentence. 


(14.77)  John  knows  that  snow  has  delayed  his  sister’s  flight. 


Settings  like  this  where  a modal  operator  like  Know  is  involved  arc 
called  referentially  opaque.  In  referentially  opaque  settings,  substitution  of 
equal  terms  may  or  may  not  succeed.  Ordinary  settings  where  such  substitu- 
tions always  work  arc  said  to  be  referentially  transparent. 


REFEREN- 

TIALLY 

OPAQUE 


REFEREN- 

TIALLY 

TRANSPAR- 

ENT 


Pitfalls 


As  noted  in  Section  14.3,  there  arc  a number  of  common  mistakes  in  rep- 
resenting the  meaning  of  natural  language  utterances,  that  arise  from  con- 
fusing, or  equating,  elements  from  real  languages  with  elements  in  FOPC. 
Consider  the  following  example,  which  on  the  surface  looks  like  a standard 
implication  rule. 

(14.78)  If  you’re  interested  in  baseball,  the  Rockies  arc  playing  tonight. 

A straightforward  translation  of  this  sentence  into  FOPC  might  look  some- 
thing like  this. 

Havelnterestln  ( Hearer ■ Baseball) 

=>•  Playing  (Rockies,  Tonight) 

This  representation  is  flawed  for  a large  number  of  reasons.  The  most  ob- 
vious ones  arise  from  the  semantics  of  FOPC  implications.  In  the  event  that 
the  hearer  is  not  interested  in  baseball,  this  formula  becomes  meaningless. 
Specifically,  we  can  not  draw  any  conclusion  about  the  consequent  clause 
when  the  antecedent  is  false.  But  of  course  this  is  a ridiculous  conclusion, 
we  know  that  the  Rockies  game  will  go  forward  regardless  of  whether  or  not 
the  hearer  happens  to  like  baseball.  Exercise  14.10  asks  you  to  come  up  with 
a more  reasonable  FOPC  translation  of  this  example. 

Now  consider  the  following  example. 

(14.79)  One  more  beer  and  I'll  fall  off  this  stool. 
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Again,  a simple-minded  translation  of  this  sentence  might  consist  of  a con- 
junction of  two  clauses:  one  representing  a drinking  event  and  one  represent- 
ing a falling  event.  In  this  case,  the  surface  use  of  the  word  and  obscures  the 
fact  that  this  sentence  instead  has  an  implication  underlying  it.  The  lesson 
of  both  of  these  examples  is  that  English  words  like  and,  or  and  if  arc  only 
tenuously  related  to  the  elements  of  FOPC  with  the  same  names. 

Along  the  same  lines,  it  is  important  to  remember  the  complete  lack 
of  significance  of  the  names  we  make  use  of  in  representing  FOPC  formulas. 
Consider  the  following  constant. 


InexpensiveVegetarianlndianFoodOnTuesdays 


Despite  its  impressive  morphology,  this  term,  by  itself,  has  no  more  meaning 
than  a constant  like  X99  would  have.  See  McDermott  (1976)  for  a discourse 
on  the  inherent  dangers  of  such  naming  schemes. 
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Over  the  years,  a fair  number  of  representational  schemes  have  been  invented 
to  capture  the  meaning  of  linguistic  utterances  for  use  in  natural  language 
processing  systems.  Other  than  logic,  two  of  the  most  widely  used  schemes 
nItworks  have  been  Semantic  Networks  and  Frames,  which  arc  also  known  as  slot- 

frames  filler  representations.  The  KL-ONE  (Brachman  and  Schmolze,  1985a),  and 

KRL  (Bobrow  and  Winograd,  1977)  systems  represent  influential  efforts  to 
represent  knowledge  for  use  in  natural  language  processing  systems. 

In  semantic  networks,  objects  arc  represented  as  nodes  in  a graph,  with 
relations  between  objects  being  represented  by  named  links.  In  frame -based 
systems,  objects  arc  represented  as  feature-structures  similar  to  those  dis- 
cussed in  Chapter  11,  which  can,  of  course,  also  be  naturally  represented  as 
graphs.  In  this  approach  features  are  called  slots  and  the  values,  or  fillers, 
of  these  slots  can  either  be  atomic  values  or  other  embedded  frames.  The 
following  diagram  illustrates  how  Example  14.72  might  be  captured  in  a 
frame -based  approach. 


I believe  Mary  ate  British  food. 
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Believing 

Believer 

Speaker 

Eating 

Believed 

eater 

Mary 

EATEN 

BritishFood 

It  is  now  widely  accepted  that  meanings  represented  in  these  approaches  can 
be  translated  into  equivalent  statements  in  FOPC  with  relative  ease. 

14.6  Alternative  Approaches  to  Meaning 

The  notion  that  the  translation  of  linguistic  inputs  into  a formal  representa- 
tion made  up  of  discrete  symbols  adequately  captures  the  notion  of  meaning 
is,  not  surprisingly,  subject  to  a considerable  amount  of  debate.  The  follow- 
ing sections  give  brief,  wholly  inadequate,  overviews  of  some  of  the  major 
concerns  in  these  debates. 

Meaning  as  Action 

An  approach  that  holds  considerable  appeal  when  we  consider  the  seman- 
tics of  imperative  sentences  is  the  notion  of  meaning  as  action.  Under  this 
view,  utterances  arc  viewed  as  actions,  and  the  meanings  of  these  utterances 
resides  in  procedures  that  arc  activated  in  the  hearer  as  a result  of  healing 
the  utterance.  This  approach  was  followed  in  the  creation  of  the  histori- 
cally important  SHRDLU  system,  and  is  summed  up  well  by  its  creator  Terry 
Winograd  (1972b). 

One  of  the  basic  viewpoints  underlying  the  model  is  that  all  lan- 
guage use  can  be  thought  of  as  a way  of  activating  procedures 
within  the  hearer.  We  can  think  of  an  utterance  as  a program  - 
one  that  indirectly  causes  a set  of  operations  to  be  carried  out 
within  the  hearer’s  cognitive  system. 

A recent  procedural  model  of  semantics  is  the  executing  schema  or 
x-schema  model  of  Bailey  et  al.  (1997),  Narayanan  (1997a,  1997b),  and 
Chang  et  al.  (1998).  The  intuition  of  this  model  is  that  various  parts  of  the 
semantics  of  events,  including  the  aspectual  factors  discussed  on  526,  arc 
based  on  schematized  descriptions  of  sensory-motor  processes  like  incep- 
tion, iteration,  enabling,  completion,  force,  and  effort.  The  model  represents 
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the  aspectual  semantics  of  events  via  a kind  of  probabilistic  automaton  called 
a Petri  net  (Murata,  1989).  The  nets  used  in  the  model  have  states  like  ready, 
process,  finish,  suspend,  and  result. 

The  meaning  representation  of  an  example  like  Jack  is  walking  to  the 
store  activates  the  process  state  of  the  walking  event.  An  accomplishment 
event  like  Jack  walked  to  the  store  activates  the  result  state.  An  iterative 
activity  like  Jack  walked  to  the  store  every  week  is  simulated  in  the  model 
by  an  iterative  activation  of  the  process  and  result  nodes.  This  idea  of  using 
sensory-motor  primitives  as  a foundation  for  semantic  description  is  also 
based  on  the  work  of  Regier  (1996)  on  the  role  of  visual  primitives  in  a 
computational  model  of  learning  the  semantics  of  spatial  prepositions. 

Meaning  as  Truth 

The  role  of  formal  meaning  representations  in  linguistics,  natural  language 
processing,  artificial  intelligence,  and  cognitive  modeling,  is  quite  different 
from  its  role  in  more  philosophical  circles.  In  the  former  approaches,  the 
name  of  the  game  is  getting  from  linguistic  inputs  to  appropriate,  unambigu- 
ous, and  operationally  useful  representations.3 

To  philosophers,  however,  the  mere  translation  of  a sentence  from  its 
original  natural  form  to  another  artificial  form  does  not  get  us  any  closer  to  its 
meaning  (Lewis,  1972).  Formal  representations  may  facilitate  real  semantic 
work,  but  arc  not  by  themselves  of  much  interest.  Under  this  view,  the  im- 
portant work  is  in  the  functions,  or  procedures,  that  determine  the  mapping 
from  these  representations  to  the  world  being  modeled.  Of  particular-  interest 
in  these  approaches  are  the  functions  that  determine  the  truth  conditions  of 
sentences,  or  their  formal  representations. 


Summary 

This  chapter  has  introduced  the  representational  approach  to  meaning.  The 
following  are  some  of  the  highlights  of  this  chapter. 

• A major  approach  to  meaning  in  computational  linguistics  involves  the 
creation  of  formal  meaning  representations  that  capture  the  meaning- 
related  content  of  linguistic  inputs.  These  representations  are  intended 
to  bridge  the  gap  from  language  to  commonsense  knowledge  of  the 


3 Of  course,  what  counts  as  useful  varies  considerably  among  these  areas 
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world. 

• The  frameworks  specify  the  syntax  and  semantics  of  these  representa- 
tions arc  called  meaning  representation  languages.  A wide  variety  of 
such  languages  arc  used  in  natural  language  processing  and  artificial 
intelligence. 

• Such  representations  need  to  be  able  to  support  the  practical  compu- 
tational requirements  of  semantic  processing.  Among  these  arc  the 
need  to  determine  the  truth  of  propositions,  to  support  unambiguous 
representation,  to  represent  variables,  to  support  inference,  and  to  be 
expressive. 

• Human  languages  have  a wide  variety  of  features  that  arc  used  to  con- 
vey meaning.  Among  the  most  important  of  these  is  the  ability  to  con- 
vey a predicate-argument  structure. 

• FOPC  is  a well-understood  computationally  tractable  meaning  repre- 
sentation language  that  offers  much  of  what  is  needed  in  a meaning 
representation  language. 

• Important  classes  of  meaning  including  categories,  events,  and  time 
can  be  captured  in  FOPC.  Propositions  corresponding  to  such  concepts 
as  beliefs  and  desires  require  extensions  to  FOPC  including  modal  op- 
erators. 

• Semantic  networks  and  frames  can  be  captured  within  the  FOPC  frame- 
work. 


Bibliographical  and  Historical  Notes 

The  earliest  computational  use  of  declarative  meaning  representations  in 
natural  language  processing  was  in  the  context  of  question-answering  sys- 
tems (Green  et  ah,  1963;  Raphael,  1968;  Lindsey,  1963).  These  systems 
employed  ad-hoc  representations  for  the  facts  needed  to  answer  questions. 
Questions  were  then  translated  into  a form  that  could  be  matched  against 
facts  in  the  knowledge  base.  Simmons  (1965)  provides  an  overview  of  these 
early  efforts. 

Woods  (1967)  investigated  the  use  of  FOPC-like  representations  in  question- 
answering as  a replacement  for  the  ad-hoc  representations  in  use  at  the  time. 
Woods  (1973)  further  developed  and  extended  these  ideas  in  the  landmark 
Lunar  system.  Interestingly,  the  representations  used  in  Lunar  had  both  a 
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truth-conditional  and  a procedural  semantics.  Winograd  (1972b)  employed 
a similar  representation  based  on  the  Micro-Planner  language  in  his  SHRDLU 
system. 

During  this  same  period,  researchers  interested  in  the  cognitive  model- 
ing of  language  and  memory  had  been  working  with  various  forms  of  asso- 
ciative network  representations.  Masterman  (1957)  was  probably  the  first  to 
make  computational  use  of  a semantic  network-like  knowledge  representa- 
tion, although  semantic  networks  arc  generally  credited  to  Quillian  (1968). 

A considerable  amount  work  in  the  semantic  network  framework  was  carried 
out  during  this  era  (Norman  and  Rumelhart,  1975;  Schank,  1972;  Wilks, 
1975c,  1975b;  Kintsch,  1974).  It  was  during  this  period  that  a number  of 
researchers  began  to  incorporate  Fillmore’s  notion  of  case  roles  (Fillmore, 
1968)  into  their  representations.  Simmons  (1973a)  was  the  earliest  adopter 
of  case  roles  as  paid  of  representations  for  natural  language  processing. 

Detailed  analyses  by  Woods  (1975)  and  Brachman  and  Schmolze  (1985a) 
aimed  at  figuring  out  what  semantic  networks  actually  mean  led  to  the  devel- 
opment of  a number  of  more  sophisticated  network-like  languages  including 
KRL  (Bobrow  and  Winograd,  1977)  and  KL-ONE  (Brachman  and  Schmolze, 
1985a).  As  these  frameworks  became  more  sophisticated  and  well-defined 
it  became  clear  that  they  were  restricted  valiants  of  FOPC  coupled  with  spe- 
cialized inference  procedures.  A useful  collection  of  papers  covering  much 
of  this  work  can  be  found  in  (Brachman  and  Levesque,  1985).  Russell  and 
Norvig  (1995)  describe  a modern  perspective  on  these  representational  ef- 
forts. 

Linguistic  efforts  to  assign  semantic  structures  to  natural  language  sen- 
tences in  the  generative  era  began  with  the  work  of  Katz  and  Fodor  (1963). 
The  limitations  of  their  simple  feature -based  representations  and  the  natu- 
ral fit  of  logic  to  many  of  linguistic  problems  of  the  day  quickly  led  to  the 
adoption  of  a variety  of  predicate-argument  structures  as  preferred  semantic 
representations  (Lakoff,  1972;  McCawley,  1968).  The  subsequent  introduc- 
tion by  Montague  (1973)  of  truth-conditional  model-theoretic  framework 
into  linguistic  theory  led  to  a much  tighter  integration  between  theories  of 
formal  syntax  and  a wide  range  of  formal  semantic  frameworks.  Good  in- 
troductions to  Montague  semantics  and  its  role  in  linguistic  theory  can  be 
found  in  (Dowty  et  al.,  1981;  Partee,  1976). 

The  representation  of  events  as  reified  objects  is  due  to  Davidson  (1967). 
The  approach  presented  here,  which  explicitly  reifies  event  participants,  is 
due  to  Parsons  (1990).  The  use  of  modal  operators  and  modal  logic  in  the 
representation  of  knowledge  and  belief  is  due  to  Hintikka  (1969a).  Moore 
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(1977)  was  the  first  to  make  computational  use  of  this  approach.  Faucon- 
nier  (1985)  deals  with  a wide  range  of  issues  relating  to  beliefs  and  belief 
spaces  from  a cognitive  science  perspective.  Most  current  computational 
approaches  to  temporal  reasoning  arc  based  on  Allen’s  notion  of  temporal 
intervals  (Allen,  1984).  ter  Meulen  (1995)  provides  a modern  treatment  of 
tense  and  aspect.  Davis  (1990)  describes  the  use  of  FOPC  to  represent  knowl- 
edge across  a wide  range  of  common  sense  domains  including  quantities, 
space,  time,  and  beliefs. 

A recent  comprehensive  treatment  of  logic  and  language  can  be  found 
in  (van  Benthem  and  ter  Meulen,  1997).  The  classic  semantics  text  is  (Lyons, 
1977).  McCawley  (1993)  is  an  indispensable  textbook  covering  a wide  range 
of  topics  concerning  logic  and  language.  Chierchia  and  McConnell-Ginet 
(1991)  also  provides  broad  coverage  of  semantic  issues  from  a linguistic 
perspective.  Heim  and  Kratzer  (1998)  is  a more  recent  text  written  from  the 
perspective  of  current  generative  theory. 


Exercises 

14.1  Choose  a recipe  from  your  favorite  cookbook  and  try  to  make  explicit 
all  the  common-sense  knowledge  that  would  be  needed  to  follow  it. 

14.2  Proponents  of  information  retrieval  occasionally  claim  that  natural 
language  texts  in  their  raw  form  arc  a perfectly  suitable  source  of  knowledge 
for  question  answering.  Sketch  an  argument  against  this  claim. 

14.3  Peruse  your  daily  newspaper  for  three  examples  of  ambiguous  sen- 
tences. Describe  the  various  sources  of  the  ambiguities. 

14.4  Consider  a domain  where  the  word  coffee  can  refer  to  the  follow- 
ing concepts  in  a knowledge-base:  a caffeinated  or  decaffeinated  beverage, 
ground  coffee  used  to  make  either  kind  of  beverage,  and  the  beans  them- 
selves. Give  arguments  as  to  which  of  the  following  uses  of  coffee  arc  am- 
biguous and  which  arc  vague. 

a.  I’ve  had  my  coffee  for  today. 

b.  Buy  some  coffee  on  your  way  home. 
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c.  Please  grind  some  more  coffee. 

14.5  Encode  in  FOPC  as  much  of  the  knowledge  as  you  can  that  you  came 
up  with  for  Exercise  14.1 

14.6  The  following  rule,  which  we  gave  as  a translation  for  Example  14.20, 
is  not  a reasonable  definition  of  what  it  means  to  be  a vegetarian  restaurant. 

\/xVegetarianRestaurant(x ) =$-  Serves(x.  VegetarianFood) 

Give  a FOPC  rule  that  better  defines  vegetarian  restaurants  in  terms  of  what 
they  serve. 

14.7  Give  a FOPC  translations  for  the  following  sentences: 

a.  Vegetarians  do  not  eat  meat. 

b.  Not  all  vegetarians  eat  eggs. 

14.8  Give  a set  of  facts  and  inferences  necessary  to  prove  the  following 
assertions: 

a.  McDonalds  is  not  a vegetarian  restaurant. 

b.  Some  vegetarians  can  eat  at  McDonalds. 

Don’t  just  place  these  facts  in  your  knowledge-base.  Show  that  they 
can  be  inferred  from  some  more  general  facts  about  vegetarians  and  Mc- 
Donalds 

14.9  Give  FOPC  translations  for  the  following  sentences  that  capture  the 
temporal  relationships  between  the  events. 

a.  When  Mary’s  flight  departed,  I ate  lunch. 

b.  When  Mary’s  flight  departed,  I had  eaten  lunch. 

14.10  Give  a reasonable  FOPC  translation  of  the  following  example. 

If  you’re  interested  in  baseball,  the  Rockies  arc  playing  tonight. 

14.11  On  Page  512  we  gave  the  following  FOPC  translation  for  Example 
14.17. 

Have(Speaker,FiveDoIlars ) A -i  FI  ave  (Speaker,  Lot  Of  Time) 

This  literal  representation  would  not  be  particularly  useful  to  a restaurant- 
oriented  question  answering  system.  Give  a deeper  FOPC  meaning  represen- 
tation for  this  example  that  is  closer  to  what  it  really  means. 
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14.12  Describe,  in  English,  the  knowledge  that  would  be  needed  to  infer 
the  deeper  representation  you  produced  for  the  last  exercise  from  the  initial 
literal  representation. 

14.13  On  Page  512,  we  gave  the  following  representation  as  a translation 
for  the  sentence  Ay  Caramba  is  near  ICSI. 

Near(LocationOf{AyCaramba ) , LocationO  f (ICSI) ) 

In  our  truth-conditional  semantics,  this  formula  is  either  true  or  false  given 
the  contents  of  some  knowledge-base.  Critique  this  truth-conditional  ap- 
proach with  respect  to  the  meaning  of  words  like  near. 
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‘Then  you  should  say  what  you  mean,’  the  March  Hare  went  on. 
T do,’  Alice  hastily  replied;  ‘at  least-at  least  I mean  what  I say- 
that’s  the  same  thing,  you  know.’ 

‘Not  the  same  thing  a bit!’  scud  the  Hatter.  ‘You  might  just  as 
well  say  that  ”1  see  what  I eat  ” is  the  same  thing  as  ”1  eat  what 
I see”!’ 

Lewis  Carroll,  Alice  in  Wonderland 


This  chapter  presents  a number  of  computational  approaches  to  the 
problem  of  semantic  analysis,  the  process  whereby  meaning  representations  analysis0 
of  the  kind  discussed  in  the  previous  chapter  arc  composed  and  assigned 
to  linguistic  inputs.  As  we  will  see  in  this  and  later  chapters,  the  creation 
of  rich  and  accurate  meaning  representations  necessarily  involves  a wide 
range  of  knowledge-sources  and  inference  techniques.  Among  the  sources  of 
knowledge  that  arc  typically  used  arc  the  meanings  of  words,  the  meanings 
associated  with  grammatical  structures,  knowledge  about  the  structure  of  the 
discourse,  knowledge  about  the  context  in  which  the  discourse  is  occurring, 
and  common-sense  knowledge  about  the  topic  at  hand. 

The  first  approach  we  cover  is  a kind  of  syntax-driven  semantic  anal- 
ysis that  is  fairly  limited  in  its  scope.  It  assigns  meaning  representations  to 
inputs  based  solely  on  static  knowledge  from  the  lexicon  and  the  grammar. 

In  this  approach,  when  we  refer  to  an  input’s  meaning,  or  meaning  represen- 
tation, we  have  in  mind  an  impoverished  representation  that  is  both  context- 
independent  and  inference-free.  Meaning  representations  of  this  type  corre- 
spond to  the  notion  of  a literal  meaning  introduced  in  the  last  chapter. 

There  arc  two  reasons  for  proceeding  along  these  lines:  there  arc  some 
limited  application  domains  where  such  representations  arc  sufficient  to  pro- 
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duce  useful  results,  and  these  impoverished  representations  can  serve  as  in- 
puts to  subsequent  processes  that  can  produce  richer,  more  useful,  meaning 
representations.  Chapters  18  and  19  will  show  how  these  meaning  represen- 
tations can  be  used  in  processing  extended  discourses,  while  Chapter  2 1 will 
show  how  they  can  be  used  in  machine  translation. 

Section  15.5  then  presents  two  alternative  approaches  to  semantic  anal- 
ysis that  arc  more  well-suited  to  practical  applications.  The  first  approach, 
semantic  grammars,  has  been  widely  applied  in  the  construction  of  inter- 
active dialog  systems.  In  this  approach,  the  elements  of  the  grammars  arc 
strongly  motivated  by  the  semantic  entities  and  relations  of  the  domain  be- 
ing discussed.  As  we  will  see,  the  actual  algorithms  used  in  this  approach 
arc  quite  similar  to  those  described  in  Section  15.1.  The  difference  lies  in 
the  grammars  that  arc  used. 

The  final  approach,  presented  in  Section  15.5,  addresses  the  task  of 
extracting  small  amounts  of  pertinent  information  from  large  bodies  of  text. 
As  we  will  see,  this  information  extraction  task  does  not  require  the  kind 
of  complete  syntactic  analysis  assumed  in  the  other  approaches.  Instead, 
a series  of  quite  limited,  mostly  finite-state,  automata  arc  combined  via  a 
cascade  to  produce  a robust  semantic  analyzer. 


Syntax-Driven  Semantic  Analysis 

The  approach  detailed  in  this  section  is  based  on  the  principle  of  composi- 
tionality. 1 The  key  idea  underlying  this  approach  is  that  the  meaning  of  a 
sentence  can  be  composed  from  the  meanings  of  it  parts.  Of  course,  when  in- 
terpreted superficially,  this  principle  is  somewhat  less  than  useful.  We  know 
that  sentences  arc  composed  of  words,  and  that  words  arc  the  primary  car- 
riers of  meaning  in  language.  It  would  seem  then  that  all  this  principle  tells 
us  is  that  we  should  compose  the  meaning  representation  for  sentences  from 
the  meanings  of  the  words  that  make  them  up. 

Fortunately,  the  Mad  Hatter  has  provided  us  with  a hint  as  to  how  to 
make  this  principle  useful.  The  meaning  of  a sentence  is  not  based  solely  on 
the  words  that  make  it  up,  it  is  based  on  the  ordering,  grouping,  and  relations 
among  the  words  in  the  sentence.  Of  course,  this  is  simply  another  way 

1 This  is  normally  referred  to  as  Frege’s  principle  of  compositionality.  There  appears  to  be 
little  reason  for  this  ascription,  since  the  principle  never  explicitly  appears  in  any  of  his  writ- 
ings. Indeed,  many  of  his  writings  can  be  taken  as  supporting  a decidedly  non-compositional 
view.  Janssen  (1997)  discusses  this  topic  in  more  detail. 
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of  saying  that  the  meaning  of  a sentence  is  partially  based  on  its  syntactic 
structure.  Therefore,  in  syntax-driven  semantic  analysis , the  composition  of 
meaning  representations  is  guided  by  the  syntactic  components  and  relations 
provided  by  the  kind  of  grammars  discussed  in  Chapters  9,  1 1,  and  12. 

We  can  begin  by  assuming  that  the  syntactic  analysis  of  an  input  sen- 
tence will  form  the  input  to  a semantic  analyzer.  Figure  15.1  illustrates  the 
obvious  pipeline-oriented  approach  that  follows  directly  from  this  assump- 
tion. An  input  is  first  passed  through  a parser  to  derive  its  syntactic  analysis. 
This  analysis  is  then  passed  as  input  to  a semantic  analyzer  to  produce  a 
meaning  representation.  Note  that  although  this  diagram  shows  a parse  tree 
as  input,  other  syntactic  representations  such  as  feature  structures,  or  lexi- 
cal dependency  diagrams,  can  be  used.  The  remainder  of  this  section  will 
assume  tree-like  inputs. 

Before  moving  on,  we  should  make  explicit  a major  assumption  about 
the  role  ambiguity  of  this  approach.  In  the  syntax  driven  approach  presented 
here,  ambiguities  arising  from  the  syntax  and  the  lexicon  will  lead  to  the  cre- 
ation of  multiple  ambiguous  meaning  representations.  It  is  not  the  job  of  the 
semantic  analyzer,  narrowly  defined,  to  resolve  these  ambiguities.  Instead, 
it  is  the  job  of  subsequent  interpretation  processes  with  access  to  domain 
specific  knowledge,  and  knowledge  of  context  to  select  among  competing 
representations.  Of  course,  we  can  cut  down  on  the  number  of  ambiguous 
representations  produced,  through  the  use  of  robust  part-of-speech  taggers, 
prepositional  phrase  attachment  mechanisms,  and,  as  we  will  see  in  Chap- 
ter 16,  word-sense  disambiguation  mechanisms. 

Let’s  consider  how  such  an  analysis  might  proceed  with  the  following 
example. 

(15.1)  AyCaramba  serves  meat. 

Figure  15.2  shows  the  simplified  parse  tree  (lacking  feature  attachments), 
along  with  an  appropriate  meaning  representation  for  this  example.  As  sug- 
gested by  the  dashed  arrows,  a semantic  analyzer  given  this  free  as  input 
might  fruitfully  proceed  by  first  retrieving  a meaning  representation  from  the 
subtree  corresponding  to  the  verb  serves.  The  analyzer  might  next  retrieve 
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S ^elsa(e,  Serving)  A Server(e,  AyCaramba)  A Served (e.Mgat ) 


NP  -■  \ VP 


Proper-Noun  Verb  - y Mass-Noun 

I I 

AyCaramba  serves  meat 


Figure  15.2  Parse  tree  for  the  sentence  AyCaramba  serves  meat. 


meaning  representations  corresponding  to  the  two  noun  phrases  in  the  sen- 
tence. Then  using  the  representation  acquired  from  the  verb  as  a template, 
the  noun  phrase  meaning  representations  can  be  used  to  bind  the  appropriate 
variables  in  the  verb  representation,  thus  producing  the  meaning  representa- 
tion for  the  sentence  as  a whole. 

Unfortunately,  there  is  a rather  obvious  problem  with  this  simplified 
story.  As  described,  the  function  used  to  interpret  the  tree  in  Figure  15.2 
must  know,  among  other  things,  that  it  is  the  verb  that  carries  the  template 
upon  which  the  final  representation  is  based,  where  this  verb  occurs  in  the 
tree,  where  its  corresponding  arguments  arc,  and  which  argument  fills  which 
role  in  the  verb’s  meaning  representation.  In  other  words,  it  requires  a good 
deal  of  specific  knowledge  about  this  particular  example  and  its  parse  tree  to 
create  the  required  meaning  representation.  Given  that  there  arc  an  infinite 
number  of  such  trees  for  any  reasonable  grammar,  any  approach  based  on 
one  semantic  function  for  every  possible  tree  is  in  serious  trouble. 

Fortunately,  we  have  faced  this  problem  before.  Languages  arc  not 
defined  by  enumerating  the  strings  or  trees  that  arc  permitted,  but  rather  by 
specifying  finite  devices  that  arc  capable  of  generating  the  required  set  of 
outputs.  It  would  seem,  therefore,  that  the  right  place  for  semantic  knowl- 
edge in  a syntax-directed  approach  is  with  the  finite  set  of  devices  that  arc 
used  to  generate  trees  in  the  first  place:  the  grammar  rules  and  the  lexical 
rul!to  entries.  This  is  known  as  the  rule  to  rule  hypothesis(Bach,  1976). 

Designing  an  analyzer  based  on  this  approach  brings  us  back  to  the  no- 
tion of  parts  and  what  it  means  for  them  to  have  meanings.  The  remainder  of 
this  section  can  be  seen  as  an  attempt  to  answer  the  following  two  questions. 

• What  does  it  mean  for  syntactic  constituents  to  have  meanings? 

• What  do  these  meanings  have  to  be  like  so  that  they  can  be  composed 
into  larger  meanings? 
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Semantic  Augmentations  to  Context-Free  Grammar  Rules 

In  keeping  with  the  approach  begun  in  Chapter  1 1 , we  will  begin  by  aug- 
menting context-free  grammar  rules  with  semantic  attachments.  These  at- 
tachments can  be  thought  of  as  instructions  that  specify  how  to  compute 
the  meaning  representation  of  a construction  from  the  meanings  of  its  con- 
stituent parts.  Abstractly,  our  augmented  rules  have  the  following  structure. 

A — > ai . . ,a„  {f(CLj.sem,...,Oik.sem)} 

The  semantic  attachment  to  the  basic  context-free  rule  is  shown  in  the 
{. . .}  to  the  right  of  the  rule’s  syntactic  constituents.  This  notation  states  that 
the  meaning  representation  assigned  to  the  construction  A,  which  we  will 
denote  as  A.sem , can  be  computed  by  running  the  function  / on  some  subset 
of  the  semantic  attachments  of  A’s  constituents. 

This  characterization  of  our  semantic  attachments  as  a simple  func- 
tion application  is  rather  abstract.  To  make  this  notion  more  concrete,  we 
will  walk  through  the  semantic  attachments  necessary  to  compute  the  mean- 
ing representation  for  a series  of  examples  beginning  with  Example  15.1, 
shown  earlier  in  Figure  15.2.  We  will  begin  with  the  more  concrete  entities 
in  this  example,  as  specified  by  the  noun  phrases,  and  work  our  way  up  to  the 
more  complex  expressions  representing  the  meaning  of  the  entire  sentence. 
The  concrete  entities  in  this  example  arc  represented  by  the  FOPC  constants 
AyCaramba  and  Meat.  Our  first  task  is  to  associate  these  constants  with  the 
constituents  of  the  tree  that  introduce  them.  The  first  step  toward  accom- 
plishing this  is  to  pair  them  with  the  lexical  rules  representing  the  words  that 
introduce  them  into  the  sentence. 

ProperNoun  — > AyCaramba  {AyCaramba} 

MassNoun  — > meat  {Meat} 

These  two  rules  specify  that  the  meanings  associated  with  the  subtrees  gen- 
erated by  these  rules  consist  of  the  constants  AyCaramba  and  Meat. 

Note,  however,  that  as  the  arrows  in  Figure  15.2  indicate,  the  subtrees 
corresponding  to  these  rules  do  not  directly  contribute  these  FOPC  constants 
to  the  final  meaning  representation.  Rather,  it  is  the  NPs  higher  in  the  tree 
that  contribute  them  to  the  final  representation.  In  keeping  with  the  principle 
of  compositionality,  we  can  deal  with  this  indirect  contribution  by  stipulating 
that  the  upper  NPs  obtain  their  meaning  representations  from  the  meanings 
of  their  children.  In  these  two  cases,  we  will  assume  that  the  meaning  repre- 
sentations of  the  children  arc  simply  copied  upward  to  the  parents. 

NP  — > ProperNoun  {ProperNoun. sem} 
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NP  — > MassNoun  {MassNoun.sem} 

These  rules  state  that  the  meaning  representation  of  the  noun  phrases 
arc  the  same  as  the  meaning  representations  of  their  individual  components, 
denoted  by  ProperNoun.sem  and  MassNoun.sem.  In  general,  it  will  be  the 
case  that  for  non-branching  grammar  rules,  the  semantic  expression  associ- 
ated with  the  child  will  be  copied  unchanged  to  the  parent. 

Before  proceeding,  we  should  point  out  that  there  is  at  least  one  poten- 
tially confusing  aspect  to  this  discussion.  While  the  static  semantic  attach- 
ment to  our  first  NP  rule  is  simply  ProperNoun.sem , the  semantic  value  of 
the  tree  produced  by  that  rule  in  this  example  is  AyCaramba.  It  is  critical 
to  distinguish  between  the  semantic  attachment  of  a rule,  and  the  semantic 
value  associated  with  a tree  generated  by  a rule.  The  first  is  a set  of  in- 
structions on  how  to  construct  a meaning  representation,  while  the  second 
consists  of  the  result  of  following  those  instructions. 

Returning  to  our  example,  having  accounted  for  the  constants  in  the 
representation,  we  can  move  on  to  the  event  underlying  this  utterance  as 
specified  by  serves.  As  illustrated  in  Figure  15.2,  a generic  Sewing  event 
involves  a Server  and  something  Served,  as  captured  in  the  following  logical 
formula. 

3 e,x,y  Isa(e, Serving)  A Server(e,x)  AServed(e.y) 

As  a first  attempt  at  this  verb’s  semantic  attachment,  we  can  simply 
take  this  logical  formula  as  serve' s semantic  attachment,  as  in  the  following. 

Verb  — > serves 

{3  e,x,y  Isa(e. Serving)  A Server(e,x)  A Served(e,y)} 

Moving  up  the  parse  tree,  the  next  constituent  to  be  considered  is  the 
VP  that  dominates  both  serves  and  meat.  Unlike  the  /VPs,  we  can  not  simply 
copy  the  meaning  of  these  children  up  to  the  parent  VP.  Rather,  we  need  to 
incorporate  the  meaning  of  the  NP  into  the  meaning  of  the  Verb  and  assign 
the  resulting  representation  to  the  VP.sem.  In  this  case,  this  consists  of  re- 
placing the  variable  y with  the  logical  term  Meat  as  the  second  argument  of 
the  Served  role  of  the  Serves  event.  This  yields  the  following  meaning  rep- 
resentation, which  can  be  glossed  as  something  like  someone  serves  meat. 

3 e,x  Isa(e, Serving)  A Setver(e.x)  A Served (e, Meat) 

To  come  up  with  this  representation,  the  semantic  attachment  for  the 
VP  must  provide  a means  to  replace  the  quantified  variable  y within  the  body 
of  V.sem  with  the  logical  constant  Meat,  as  stipulated  by  NP.sem.  Abstracting 
away  from  this  specific  example,  the  VP  semantic  attachment  must  have  two 
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capabilities:  the  means  to  know  exactly  which  variables  within  the  Verb's 
semantic  attachment  arc  to  be  replaced  by  the  semantics  of  the  Verb's  argu- 
ments, and  the  ability  to  perform  such  a replacement. 

Unfortunately,  there  is  no  straightforward  way  to  do  this  given  the 
mechanisms  we  now  have  at  our  disposal.  The  FOPC  formula  we  attached  to 
the  V.sem  does  not  provide  any  advice  about  when  and  how  each  of  its  three 
quantified  variables  should  be  replaced,  and  we  have  no  simple  way,  within 
our  current  specification  of  FOPC,  for  performing  such  a replacement  even  if 
we  did  know. 

Fortunately,  there  is  a notational  extension  to  FOPC  called  the  lambda 
notation(Church,  1940)  that  provides  exactly  the  kind  of  formal  parameter  notation 
functionality  that  we  need.  This  notation  extends  the  syntax  of  FOPC  to 
include  expressions  of  the  following  form. 

kxP(x) 

Such  expressions  consist  of  the  Greek  symbol  k,  followed  by  one  or  more 
variables,  followed  by  a FOPC  expression  that  makes  use  of  those  variables. 

The  usefulness  of  these  ^-expressions  is  based  on  the  ability  to  apply 
them  to  logical  terms  to  yield  new  FOPC  expressions  where  the  formal  pa- 
rameter variables  arc  bound  to  the  specified  terms.  This  process  is  known 
as  ^-reduction  and  is  little  more  than  a simple  textual  replacement  of  the 
k variables  with  the  specified  FOPC  terms,  accompanied  by  the  subsequent 
removal  of  the  k.  The  following  expressions  illustrate  the  application  of  a 
^-expression  to  the  constant  A,  followed  by  the  result  of  performing  a k- 
reduction  on  this  expression. 

AxP(x)(A) 

P(A) 

This  X-notation  provides  both  of  the  capabilities  we  said  were  needed  in  the 
Verb  semantics:  the  formal  parameter  list  makes  a set  of  variables  within  the 
body  available,  and  the  ^.-reduction  process  implements  the  desired  replace- 
ment of  variables  with  terms. 

An  important  and  useful  variation  of  this  technique  is  the  use  of  one 
^-expression  as  the  body  of  another  as  in  the  following  expression. 

kxky  Near(x.y) 

This  fairly  abstract  expression  can  be  glossed  as  the  state  of  some- 
thing being  near  something  else.  The  following  expressions  illustrate  a sin- 
gle ^-application  and  subsequent  reduction  with  this  kind  of  embedded  k- 
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expression. 

XxXy  Near(x,y)  ( ICSI ) 

Xy  Near(ICSl  ,y) 

The  important  point  here  is  that  the  resulting  expression  is  still  a X- 
expression;  the  first  reduction  bound  the  variable  x and  removed  the  outer 
X.  thus  revealing  the  inner  expression.  As  might  be  expected,  this  result- 
ing ^-expression  can,  in  turn,  be  applied  to  another  term  to  arrive  at  a fully 
specified  logical  formula,  as  in  the  following. 

Xy  N ear(ICSI  ,y)  (AyCaramba) 

Near  (ICSI,  AyCaramba ) 

currying  This  technique,  called  currying  2(Schdnklinkcl,  1924),  is  a way  of 

converting  a predicate  with  multiple  arguments  into  a sequence  of  single 
argument  predicates.  As  we  will  see  shortly,  this  technique  is  quite  useful 
when  the  arguments  to  a predicate  do  not  all  appeal-  together  as  daughters  of 
the  predicated  in  a parse  tree. 

With  the  ^.-notation  and  the  process  of  ^-reduction,  we  have  the  tools 
needed  to  return  to  the  semantic  attachments  for  our  VP  constituent.  Re- 
call that  what  was  needed  was  a way  to  replace  the  variable  representing  the 
Served  role  with  the  meaning  representation  provided  by  the  NP  constituent 
of  the  VP.  This  can  be  accomplished  in  two  steps:  changing  the  semantic 
attachment  of  the  Verb  to  a ^-expression,  and  having  the  semantic  attach- 
ment of  the  VP  apply  this  expression  to  the  NP  semantics.  The  first  of  these 
steps  can  be  accomplished  by  designating  x,  the  variable  corresponding  to 
the  Served  role,  as  the  X-variable  for  a ^-expression  provided  as  the  seman- 
tic attachment  for  serve. 

Verb  — > serves 

{Xx3e,y  Isa(e,  Serving)  A Server(e,y)  A Served (e.x)} 

This  attachment  makes  the  variable  x externally  available  to  be  bound 
by  an  application  of  this  expression  to  a logical  term.  The  attachment  for  our 
transitive  VP  rule,  therefore,  specifies  aX-application  where  the  ^-expression 
is  provided  by  Verb.sem  and  the  argument  is  provided  by  NP.sem. 

VP  —r  Verb  NP  {Verb. sem( NP.sem)} 

This  ^-application  results  in  the  replacement,  or  binding,  of  x,  the 
single  formal  parameter  of  the  ^-expression,  with  the  value  contained  in 

2 Currying  is  the  standard  term,  although  Heim  and  Kratzer  (1998)  present  an  interest- 
ing argument  for  the  term  Schonkfinkelization  over  currying,  since  Curry  later  built  on 
Schonfinkel’s  work. 
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NP.sem.  A ^-reduction  removes  the  X revealing  the  inner  expression  with 
the  parameter  x replaced  by  the  constant  Meat.  This  expression,  the  mean- 
ing of  the  verb  phrase  serves  meat , is  then  the  value  of  VP.sem. 

3 e,y  Isa(e, Serving)  A Server(e,y)  AServed(e.Meat) 

To  complete  this  example,  we  must  create  the  semantic  attachment  for 
the  S rule.  Like  the  VP  rule,  this  rule  must  incorporate  an  NP  argument  into 
the  appropriate  role  in  the  event  representation  now  residing  in  the  VP.sem.  It 
should,  therefore,  consist  of  another  X- application  where  the  value  of  VP.sem 
provides  the  ^-expression  and  the  sentence-initial  NP.sem  provides  the  final 
argument  to  be  incorporated. 

S — > NP  VP  { VP.  sem( NP.  sem ) } 

Unfortunately,  as  it  now  stands  the  value  of  VP. Sem  doesn't  provide  the 
necessary  X expression.  The  /nmhdfl-application  performed  at  the  VP  rule 
resulted  in  a generic  FOPC  expression  with  two  existentially  quantified  vari- 
ables. The  Verb  attachment  should  instead  have  consisted  of  an  embedded 
^-expression  to  make  the  Server  role  available  for  binding  at  the  S level  of 
the  grammar.  Therefore,  our  revised  representation  of  the  Verb  attachment 
will  be  the  following. 

Verb  — > serves 

{XDvy  Be  I sa(e. Serving)  A Server(e.y)  A Served (e,x)} 

The  body  of  this  Verb  attachment  consists  of  a ^-expression  inside  a 
^-expression.  The  outer  expression  provides  the  variable  that  is  replaced  by 
the  first  ^-reduction,  while  the  inner  expression  can  be  used  to  bind  the  final 
variable  corresponding  to  the  Server  role.  This  ordering  of  the  variables  in 
the  multiple  layers  ^-expressions  in  semantic  attachment  of  the  verb  explic- 
itly encodes  facts  about  the  expected  location  of  a Verb' s arguments  in  the 
syntax. 

The  parse  tree  for  this  example,  with  each  node  annotated  with  its  cor- 
responding semantic  value,  is  shown  in  Figure  15.3. 

This  example  has  served  to  illustrate  several  of  the  most  basic  tech- 
niques used  in  this  syntax-driven  approach  to  semantic  analysis.  Section 
15.2  will  provide  a more  complete  inventory  of  semantic  attachments  for 
some  of  the  major  English  grammatical  categories.  Before  proceeding  to 
that  inventory,  however,  we  will  first  analyze  several  additional  examples. 
These  examples  will  serve  to  introduce  a few  more  of  the  basic  constructs 
needed  to  make  this  approach  work,  and  will  illustrate  the  general  approach 
to  developing  semantic  attachments  for  a grammar. 
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S 3 eIsa(e,Ser\’ing)  A Server(e,AC)  AServed(e.Meat) 

NP  AC  VP  Xx3elsa(e, Serving)  A Server(e,x)  AServed(e.Meat) 

NP  Meat 


ProperNoun  AC  Verb  Mass-Noun  Meat 

AyCaramba  serves  meat 

Figure  15.3  Parse  tree  with  semantic  attachments  for  AyCaramba  serves 
meat. 


Let’s  consider  the  following  variation  on  Example  15.1. 

(15.2)  A restaurant  serves  meat. 

Since  the  verb  phrase  of  this  example  is  unchanged  from  Example  15.1,  we 
can  restrict  our  attention  to  the  derivation  of  the  semantics  of  the  subject 
noun  phrase  and  its  subsequent  integration  with  the  verb  phrase  in  the  S rule. 
As  a stalling  point,  let’s  assume  that  the  following  formula  is  a plausible 
representation  for  the  meaning  of  the  subject  in  this  example. 

3x1  sa  (x.  Restaurant ) 

Combining  this  new  representation  with  the  one  already  developed  for  the 
verb  phrase,  yields  the  following  meaning  representation. 

3e.x  I sa(e,  Serving) 

A Server  (e,x)  A Served '(e,  Meat)  Rlsa(x,  Restaurant) 

In  this  formula,  the  restaurant,  represented  by  the  variable  x,  is  specified  as 
playing  the  role  of  the  Server  by  its  presence  as  the  second  argument  to  the 
Server  predicate. 

Unfortunately,  the  ^-application  specified  as  the  semantic  attachment 
for  the  S rule  will  not  produce  this  result.  A literal  interpretation  of  X- 
reduction  as  a textual  replacement  results  in  the  following  expression,  where 
the  entire  meaning  representation  of  the  noun  phrase  is  embedded  in  the 
Server  predicate. 

Be  Isa(e,  Serving) 

A Server  (e,  3x1  sa  (x.  Restaurant ) ) A Served  (e.  Meat ) 

Although  this  expression  has  a certain  intuitive  appeal,  it  is  not  a valid 
FOPC  formula.  Expressions  like  the  one  denoting  our  restaurant  can  not 
appeal-  as  arguments  to  predicates;  such  arguments  are  limited  to  FOPC  terms. 
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In  fact,  since  by  definition  ^-expressions  can  only  be  applied  to  FOPC  terms, 
the  application  of  the  ^-expression  attached  to  the  VP  to  the  semantics  of  the 
subject  was  ill-formed  to  begin  with. 

We  can  solve  this  problem  in  a manner  similar  to  the  way  that  k- 
expressions  were  used  to  solve  the  verb  phrase  and  S semantic  attachment 
problems:  by  adding  a new  notation  to  the  existing  FOPC  syntax  that  facil- 
itates the  compositional  creation  of  the  desired  meaning  representation.  In 
this  case,  we  will  introduce  the  notion  of  a complex-term  that  allows  FOPC  termlex 
expressions  like  3x1  sci  (x.  Restaurant ) to  appeal-  in  places  where  normally 
only  ordinary  FOPC  terms  would  appeal-.  Formally,  a complex-term  is  an 
expression  with  the  following  three-part  structure. 

< Quantifier  variable  body  > 

Applying  this  notation  to  our  current  example,  we  arrive  at  the  follow- 
ing representation. 

Be  Isa(e, Serving) 

A Server  (e,<  -3x1  sa(x,  Restaurant)  >)  A Served (e, Meat) 

As  was  the  case  with  ^-expressions,  this  notational  change  will  only 
be  useful  if  we  can  provide  a straightforward  way  to  convert  it  into  ordinary 
FOPC  syntax.  This  can  be  accomplished  by  rewriting  any  predicate  using  a 
complex-term  according  to  the  following  schema. 

P(  < Quantifier  variable  body  >) 

Quantifier  variable  body  Connective  P(variable) 

In  other  words,  the  complex-term: 

1.  Is  extracted  from  the  predicate  in  which  it  appeal's, 

2.  Is  replaced  by  the  variable  that  represents  the  object  in  question, 

3.  And  has  its  variable,  quantifier,  body  prepended  to  the  new  expression 

through  the  use  of  an  appropriate  connective. 

The  following  pair  of  expressions  illustrates  this  complex-term  reduc- 
tion on  our  current  example. 

Server  (e.  < 3x1  sa (x. Restaurant ) >) 

3x1  sa ( x , Restaurant ) A Server (e.x) 

The  connective  that  is  used  to  attach  the  extracted  formula  to  the  front  of  the 
new  expression  depends  on  the  type  of  the  quantifier  being  used:  A is  used 
with  3,  and  =4>  is  used  with  V. 
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It  will  also  be  useful  to  be  able  to  access  the  three  components  of 
complex-terms.  We  will,  therefore,  extend  the  syntax  used  to  refer  to  the 
semantics  of  a constituent  by  allowing  reference  to  its  parts.  For  exam- 
ple, if  A.sem  is  a complex-term  then  A. sem. quantifier,  A. sem. variable,  and 
A.sem.body  retrieve  the  complex-term's  quantifier,  variable,  and  body,  re- 
spectively. 

Returning  to  Example  15.2,  we  can  now  address  the  creation  of  the 
target  meaning  representation  for  the  phrase  a restaurant.  Given  the  simple 
syntactic  structure  of  this  noun  phrase,  the  job  of  the  NP  semantic  attachment 
is  fairly  straightforward. 

NP  — > Det  Nominal  {<  Det. sem  x Nominal. sem{x)  >} 

This  attachment  creates  a complex-term  consisting  of  a quantifier  retrieved 
from  the  Det , followed  by  an  arbitrary  variable,  and  then  an  application  of  the 
^.-expression  associated  with  the  Nominal  to  that  variable.  This  ^-application 
ensures  that  the  correct  variable  appeal's  within  the  predicate  specified  by  the 
Nominal. 

The  attachment  for  the  determiner  simply  specifies  the  quantifier  to  be 

used. 

Det  — > a {3} 

The  job  of  the  nominal  category  is  to  create  the  Isa  formula  and  k- 
expression  needed  for  use  in  the  noun  phrase. 

Nominal  — > Noun  {'kxIsa{x.Noun.sem)} 

Finally,  the  noun  attachment  simply  provides  the  name  of  the  category 
being  discussed. 

Noun  — > restaurant  { Restaurant } 

In  walking  through  this  example,  we  have  introduced  five  concrete 
mechanisms  that  instantiate  the  abstract  functional  characterization  of  se- 
mantic attachments  that  began  this  section. 

• The  association  of  normal  FOPC  expressions  with  lexical  items. 

• The  association  of  function-like  ^-expressions  with  lexical  items. 

• The  copying  of  semantic  values  from  children  to  parents. 

• The  function-like  application  of  ^-expressions  to  the  semantics  of  one 

or  more  children  of  a constituent. 

• The  use  of  complex-terms  to  allow  quantified  expressions  to  be  tem- 
porarily treated  as  terms. 
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The  introduction  of  ^-expressions  and  complex-terms  was  motivated 
by  the  gap  between  the  syntax  of  FOPC  and  the  syntax  of  English.  These 
extra-logical  devices  serve  to  bring  the  syntax  of  FOPC  closer  to  the  syn- 
tax of  the  language  being  processed  thus  facilitating  the  semantic  analysis 
process.  Meaning  representations  that  make  use  of  these  kinds  of  devices 
are  usually  referred  to  as  quasi-logical  forms  or  intermediate  representa-  logPcal 
tions.  Note,  there  is  a subtle  difference  in  usage  between  these  two  uses.  The 
term  quasi-logical  form  is  usually  applied  to  representations  that  can  easily  representa- 
be  converted  to  a logical  representation  via  some  simple  syntactic  transfer-  TI0NS 
mation.  The  term  intermediate  representation  is  normally  used  to  refer  to 
meaning  representations  that  serve  as  input  to  further  analysis  processes  in 
an  attempt  to  produce  deeper  meaning  representations. 

For  the  purposes  of  this  chapter,  our  meaning  representations  arc  quasi- 
logical  forms  since  they  can  easily  be  converted  to  FOPC.  From  a somewhat 
broader  perspective,  they  arc  also  intermediate  forms  since  further  interpre- 
tation is  certainly  needed  to  get  them  closer  to  reasonable  meaning  represen- 
tations. 

The  few  rules  introduced  in  this  section  also  serve  to  illustrate  a prin- 
ciple that  guides  the  design  of  semantic  attachments  in  the  compositional 
framework.  In  general,  it  is  the  lexical  rules  that  provide  content  level  pred- 
icates and  terms  for  our  meaning  representations.  The  semantic  attachments 
to  grammar  rules  put  these  predicates  and  terms  together  in  the  right  ways, 
but  do  not  in  general  introduce  predicates  and  terms  into  the  representation 
being  created. 

Quantifier  Scoping  and  the  Translation  of  Complex  Terms 

The  schema  given  above  to  translate  expressions  containing  complex  terms 
into  FOPC  expressions  is,  unfortunately,  not  unique.  Consider  the  following 
example,  along  with  its  original  unscoped  meaning  representation. 

(15.3)  Every  restaurant  has  a menu. 

3el sa(e.  Having) 

AH  averse.  < V x Isa(x,  Restaurant)  >) 

A Had(e,<  3 y Isa(y,Menu)  >) 

If  the  complex-terms  filling  the  Haver  and  the  Had  roles  arc  rewritten 
so  that  the  quantifier  for  the  Haver  role  has  the  outer  scope,  then  the  result 
is  the  following  meaning  representation,  which  corresponds  to  the  common- 
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sense  interpretation  of  this  sentence. 

MxRestaurant(x)  =>■ 

3e.y  Having (e ) A Haver(e,x)  t \Isa(y,Menu ) A Had(e.y) 

On  the  other  hand,  if  the  terms  are  rewritten  in  the  reverse  order,  then 
the  following  FOPC  representation  results,  which  states  that  there  is  one 
menu  that  all  restaurants  share. 

By  Isa  (y.  Menu ) A Vx  Isa  (x,  Restaurant ) => 

3eH aving(e)  AHaver(e.x)  RHad(e,y) 

This  example  illustrates  the  problem  of  ambiguous  quantifier  scoping 
- a single  logical  formula  with  two  complex  terms  gives  rise  to  two  distinct 
and  incompatible  FOPC  representations.  In  the  worst  case,  sentences  with  N 
quantifiers  will  have  0(N!)  different  possible  quantifier  scopings. 

In  practice,  most  systems  employ  an  ad  hoc  set  of  heuristic  preference 
rules  that  can  be  used  to  generate  preferred  forms  in  order  of  their  overall 
likelihood.  In  cases  where  no  preference  rules  apply,  a left  to  right  quantifier 
ordering  that  mirrors  the  surface  order  of  the  quantifiers  is  used.  Domain 
specific  knowledge  can  then  be  used  to  either  accept  a quantified  formula,  or 
reject  it  and  request  another  formula.  Alshawi  (1992)  presents  a comprehen- 
sive approach  to  generating  plausible  quantifier  scopings. 


15.2  Attachments  for  a Fragment  of  English 

This  section  describes  a set  of  semantic  attachments  for  a small  fragment 
of  English.  As  in  the  rest  of  this  chapter,  to  keep  the  presentation  simple, 
we  omit  the  feature  structures  associated  with  these  rules  when  they  are  not 
needed.  Remember  that  these  features  are  needed  to  ensure  that  the  cor- 
rect rules  are  applied  in  the  correct  situations.  Most  importantly  for  this 
discussion,  they  are  needed  to  ensure  that  the  correct  verb  entries  are  being 
employed  based  on  their  subcategorization  feature  structures. 

Sentences 

For  the  most  paid,  our  semantic  discussions  have  only  dealt  with  declarative 
sentences.  This  section  expands  our  coverage  to  include  the  other  sentence 
types  first  introduced  in  Chapter  9:  imperatives,  Yes/No  questions,  and  WH 
questions.  Let’s  staid  by  considering  the  following  examples. 

(15.4)  Flight  487  serves  lunch. 
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(15.5)  Serve  lunch. 

(15.6)  Does  Flight  207  serve  lunch? 

(15.7)  Which  flights  serve  lunch? 

The  meaning  representations  of  these  examples  all  contain  proposi- 
tions concerning  the  serving  of  lunch  on  flights.  However,  they  differ  with 
respect  to  the  role  that  these  propositions  arc  intended  to  serve  in  the  settings 
in  which  they  arc  uttered.  More  specifically,  the  first  example  is  intended  to 
convey  factual  information  to  a hearer,  the  second  is  a request  for  an  action, 
and  the  last  two  arc  requests  for  information.  To  capture  these  differences, 
we  will  introduce  a set  of  operators  that  can  be  applied  to  FOPC  sentences 
in  the  same  way  that  belief  operators  were  used  in  Chapter  14.  Specifically, 
the  operators  DCL , IMP , YNQ,  and  WHQ  will  be  applied  to  the  FOPC  repre- 
sentations of  declaratives,  imperatives,  yes-no  questions,  and  wh-questions, 
respectively. 

Producing  meaning  representations  that  make  appropriate  use  of  these 
operators  requires  the  right  set  of  semantic  attachments  for  each  of  the  pos- 
sible sentence  types.  For  declarative  sentences,  we  can  simply  alter  the  basic 
sentence  rule  we  have  been  using  as  follows. 

5 — > NP  VP  {DCL(VP. sem [NP. sem ) ) } 

The  normal  interpretation  for  a representation  headed  by  the  DCL  operator 
would  be  as  a factual  statement  to  be  added  to  the  current  knowledge-base. 

Imperative  sentences  begin  with  a verb  phrase  and  lack  an  overt  sub- 
ject. Because  of  the  missing  subject,  the  meaning  representation  for  the  main 
verb  phrase  will  consist  of  a ^-expression  with  an  unbound  ^-variable  rep- 
resenting this  missing  subject.  To  deal  with  this,  we  can  simply  supply  a 
subject  to  the  ^-expression  by  applying  a final  ^-reduction  to  a dummy  con- 
stant. The  IMP  operator  can  then  be  applied  to  this  representation  as  in  the 
following  semantic  attachment. 

S ->  VP  {IMP  {VP.  sem  {Dummy  You))} 

Applying  this  rule  to  Example  15.5,  results  in  the  following  represen- 
tation. 

IMP  (3eServing(e)  A Server(e. DummyYou ) A Served (. e , Lunch) 

As  will  be  discussed  in  Chapter  19,  imperatives  can  be  viewed  as  a kind  of 
speech  act  - actions  that  are  performed  by  virtue  of  being  uttered. 

As  discussed  in  Chapter  9,  yes-no-questions  consist  of  a sentence- 
initial  auxiliary  verb,  followed  by  a subject  noun  phrase  and  then  a verb 
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phrase.  The  following  semantic  attachment  simply  ignores  the  auxiliary,  and 
with  the  exception  of  the  YNQ  operator,  constructs  the  same  representation 
that  would  be  created  for  the  corresponding  declarative  sentence. 

S — > Aux  NP  VP  { YNQ  ( VP.  sem  {NP.  sem ) ) } 

The  use  of  this  rule  with  for  Example  15.6  produces  the  following  rep- 
resentation. 

YNQ{3eServing{e)  A Server (e.Flt2Q7)  A Served !{e, Lunch)) 

Yes-no-questions  should  be  thought  as  asking  the  whether  the  proposi- 
tional part  of  its  meaning  is  true  or  false  given  the  knowledge  currently  con- 
tained in  the  knowledge-base.  Adopting  the  kind  of  semantics  described  in 
Chapter  14,  yes-no-questions  can  be  answered  by  determining  if  the  proposi- 
tion is  in  the  knowledge-base,  or  if  can  be  inferred  from  the  knowledge-base. 

Unlike  yes-no-questions,  wh-subject-questions  ask  for  specific  infor- 
mation about  the  subject  of  the  sentence  rather  than  the  sentence  as  a whole. 
The  following  attachment  produces  a representation  that  consists  of  the  op- 
erator WHQ , the  variable  corresponding  to  the  subject  of  the  sentence,  and 
the  body  of  the  proposition. 

S ->  Wh  Word  NP  VP  { WHQ  { NP.  sem.  var.  VP.  sem  {NP.  sem ) ) } 

The  following  representation  is  the  result  of  applying  this  rule  to  Ex- 
ample 15.7. 

WHQ{x.  3e,x  Isa{e.  Serving)  A Server{e,x) 

AServed {e.  Lunch ) A Isa (jc,  Flight ) ) 

Such  questions  can  be  answered  by  returning  a set  of  assignments  for  the 
subject  variable  that  make  the  resulting  proposition  true  with  respect  to  the 
current  knowledge-base. 

Finally,  consider  the  following  wh-non-subject-question. 

(15.8)  How  can  I go  from  Minneapolis  to  Long  Beach? 

In  examples  like  this,  the  question  is  not  about  the  subject  of  the  sentence  but 
rather  some  other  argument,  or  some  aspect  of  the  proposition  as  a whole. 
In  this  case,  the  representation  needs  to  provide  an  indication  as  to  what  the 
question  is  about.  The  following  attachment  provides  this  information  by 
providing  the  semantics  of  the  auxiliary  as  an  argument  to  the  WHQ  operator. 

S ->  WhWord  Aux  NP  VP  {WHQ  WhWord.sem  VP.sem(NP.sem)} 


Section  15.2.  Attachments  for  a Fragment  of  English 


559 


The  following  representation  would  result  from  an  application  of  this 
rule  to  Example  15.8. 

WHQ(How,  Be  Isa(e.  Going)  A Goer{e , U ser) 

t\Origin{e . M inn)  A Destincition(e.  Long  Beach) ) 

As  we  will  discuss  in  Section  15.5  and  Chapter  19,  correctly  answering  this 
kind  of  question  involves  a fair  amount  of  domain  specific  reasoning.  For  ex- 
ample, the  correct  way  to  answer  Example  15.8  is  to  search  for  flights  with 
the  specified  departure  and  arrival  cities.  Note,  however,  that  there  is  no  men- 
tion of  flights  or  flying  in  the  actual  question.  The  question-answerer  there- 
fore has  to  apply  knowledge  specific  to  this  domain  to  the  effect  that  ques- 
tions about  going  places  arc  really  questions  about  flights  to  those  places. 

Finally,  we  should  make  it  clear  that  this  particular  attachment  is  only 
useful  for  rather  simple  wh-questions  without  missing  arguments  or  embed- 
ded clauses.  As  discussed  in  Chapter  11,  the  presence  of  long-distance 
dependencies  in  these  questions  requires  additional  mechanisms  to  deter- 
mine exactly  what  is  being  asked  about.  Woods  (1977)  and  Alshawi  (1992) 
provide  extensive  discussions  of  general  mechanisms  for  handling  wh-non- 
subject  questions.  Section  15.5  presents  a more  ad  hoc  approach  that  is  often 
used  in  practical  systems. 

Noun  Phrases 

As  we  have  already  seen,  the  meaning  representations  for  noun  phrases  can 
be  either  normal  FOPC  terms  or  complex-terms.  The  following  sections  de- 
tail the  semantic  attachments  needed  to  produce  meaning  representations  for 
some  of  the  most  frequent  kinds  of  English  noun  phrases.  Unfortunately,  as 
we  will  see,  the  syntax  of  English  noun  phrases  provides  surprisingly  little 
insight  into  their  meaning.  It  is  often  the  case  that  the  best  we  can  do  is 
provide  a rather  vague  intermediate  level  of  meaning  representation  that  can 
serve  as  input  to  further  interpretation  processes. 

Compound  Nominals 

Compound  nominals,  also  known  as  noun-noun  sequences,  consist  of  simple 
sequences  of  nouns,  as  in  the  following  examples. 

(15.9)  Flight  schedule 

(15.10)  S ummer  flight  schedule 

As  noted  in  Chapter  9,  the  syntactic  structure  of  this  construction  can  be 
captured  by  the  regular  expression  Noun*,  or  by  the  following  context-free 
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grammar  rules. 


Nominal  — > Noun 

Nominal  — > Noun  Nominal 

In  these  constructions,  the  final  noun  in  the  sequence  is  the  head  of  the 
phrase  and  denotes  an  object  that  is  semantically  related  in  some  unspeci- 
fied way  to  the  other  nouns  that  precede  it  in  the  sequence.  In  general,  an 
extremely  wide  range  of  common-sense  relations  can  be  denoted  by  this  con- 
struction. Discerning  the  exact  nature  of  these  relationships  is  well  beyond 
the  scope  of  the  kind  of  superficial  semantic  analysis  presented  in  this  chap- 
ter. The  attachment  in  the  following  rule  builds  up  a vague  representation 
that  simply  notes  the  existence  of  a semantic  relation  between  the  head  noun 
and  the  modifying  nouns,  by  incrementally  noting  such  a relation  between 
the  head  noun  and  each  noun  to  its  left. 

Nominal  — > Noun  Nominal 

{Xx  Nominal,  sem(x)  / \NN(Noun.sem , x)} 

The  relation  NN  is  used  to  specify  that  a relation  holds  between  the 
modifying  elements  of  a compound  nominal  and  the  head  Noun.  In  the  ex- 
amples given  above,  this  leads  to  the  following  meaning  representations. 

Xxlsa  (x.  Sched ule ) A NN  ( x.Fl ight ) 

Xxlsa(x, Schedule)  ANN(x, Flight)  ANN (x, Summer) 

Note  that  this  representation  correctly  instantiates  a term  representing 
a Schedule,  while  avoiding  the  creation  of  terms  representing  either  a Flight 
or  Summer. 

Genitive  Noun  Phrases 

Recall  from  Chapter  9 that  genitive  noun  phrases  make  use  of  complex  deter- 
miners that  consist  of  noun  phrases  with  possessive  markers,  as  in  Atlanta  \s 
airport  and  Maharani’s  menu.  It  is  quite  tempting  to  represent  the  relation 
between  these  words  as  an  abstract  kind  of  possession.  A little  introspec- 
tion, however,  reveals  that  the  relation  between  a city  and  its  airport  has  little 
in  common  with  a restaurant  and  its  menu.  Therefore,  as  with  compound 
nominals,  it  turns  out  to  be  best  to  simply  state  an  abstract  semantic  relation 
between  the  various  constituents. 

NP  — > ComplexDet  Nominal 

{<  BxN ominal ’.sem(x)  AGN (x, ComplexDet .sem)  >} 
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ComplexDet  — > NP  ’s  {NP.sem} 

Applying  these  rules  to  Atlanta’s  airport  results  in  the  following  com- 
plex term. 

< 3 xlsa(x, Airport)  A GN {x, Atlanta)  > 

Subsequent  semantic  interpretation  would  have  to  determine  that  the  relation 
denoted  by  the  relation  GN  is  actually  a location. 

Adjective  Phrases 

English  adjectives  can  be  split  into  two  major  categories:  pre-nominal  and 
predicate.  These  categories  arc  exemplified  by  the  following  BERP  exam- 
ples. 

(15.1 1)  I don’t  mind  a cheap  restaurant. 

(15.12)  This  restaurant  is  cheap. 

For  the  pre-nominal  case,  an  obvious  and  often  incorrect  proposal  for 
the  semantic  attachment  is  illustrated  in  the  following  rules. 

Nominal  — >•  Ad j Nominal 

{Xx  Nominal  ,sem(x)  Alsa(x.Ad j.sem)} 

Adj  — > cheap  {Cheap} 

This  solution  modifies  the  semantics  of  the  nominal  by  applying  the  predi- 
cate provided  by  the  adjective  to  the  variable  representing  the  nominal.  For 
our  cheap  restaurant  example,  this  yields  the  following  fairly  reasonable  rep- 
resentation. 

Xx  Isa  (x,  Restaurant ) A Isa  ( x . Cliea p ) 

This  is  an  example  of  what  is  known  as  intersective  semantics  since 
the  meaning  of  the  phrase  can  be  thought  of  as  the  intersection  of  the  cate- 
gory stipulated  by  the  nominal  and  the  category  stipulated  by  the  adjective. 
In  this  case,  this  amounts  to  the  intersection  of  the  category  of  cheap  things 
with  the  category  of  restaurants. 

Unfortunately,  this  solution  often  does  the  wrong  thing.  For  example, 
consider  the  following  meaning  representations  for  the  phrases  small  ele- 
phant, former  friend,  and  fake  gun. 

Xx  Isa  (x,  E 1 e pliant ) A Isa  ( x . Small ) 

Xx  Isa(x. Friend)  Rlsa(x, Former) 

Xx  Isa(x.Gun)  A Isa(x,Fake) 


INTERSEC- 
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Each  of  these  representations  is  peculiar  in  some  way.  The  first  one  states 
that  this  particular  elephant  is  a member  of  the  general  category  of  small 
things,  which  is  probably  not  true.  The  second  example  is  strange  in  two 
ways:  it  asserts  that  the  person  in  question  is  a friend,  which  is  false,  and  it 
makes  use  of  a fairly  unreasonable  category  of  former  things.  Similarly,  the 
third  example  asserts  that  the  object  in  question  is  a gun  despite  the  fact  that 
fake  means  it  is  not  one. 

As  with  compound  nominals,  there  is  no  clever  solution  to  these  prob- 
lems within  the  bounds  of  our  current  compositional  framework.  Therefore, 
the  best  approach  is  to  simply  note  the  status  of  a specific  kind  of  modifi- 
cation relation  and  assume  that  some  further  procedure  with  access  to  addi- 
tional relevant  knowledge  can  replace  this  vague  relation  with  an  appropriate 
representation  (Alshawi,  1992). 

Nominal  — >■  Ad j Nominal 

{kx  Nominal. sem(x)  A AM(x,Ad j.sem)} 

Applying  this  rule  to  a cheap  restaurant  results  in  the  following  formula. 

Bx Isa(x, Restaurant)  A AM(x, Cheap) 

Note  that  even  this  watered-down  proposal  produces  representations 
that  are  logically  incorrect  for  the  fake  and  former  examples.  In  both  cases, 
it  asserts  that  the  objects  in  question  are  in  fact  members  of  their  stated  cate- 
gories. In  general,  the  solution  to  this  problem  has  to  be  based  on  the  specific 
semantics  of  the  adjectives  and  nouns  in  question.  For  example,  the  seman- 
tics of  former  has  to  involve  some  form  of  temporal  reasoning,  while  fake 
requires  the  ability  to  reason  about  the  nature  of  concepts  and  categories. 

Verb  Phrases 

The  general  schema  for  computing  the  semantics  of  verb  phrases  relies  on 
the  notion  of  function  application.  In  most  cases,  the  ^-expression  attached 
to  the  verb  is  simply  applied  to  the  semantic  attachments  of  the  verb’s  ar- 
guments. There  are,  however,  a number  of  situations  that  force  us  to  depart 
somewhat  from  this  general  pattern. 

Infinitive  Verb  Phrases 

A fair  number  of  English  verbs  take  some  form  of  verb  phrase  as  one  of  their 
arguments.  This  complicates  the  normal  verb  phrase  semantic  schema  since 
these  argument  verb  phrases  interact  with  the  other  other  arguments  of  the 
head  verb  in  ways  that  are  not  completely  obvious. 
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Consider  the  following  example. 

(15.13)  I told  Harry  to  go  to  Maharani. 

The  meaning  representation  for  this  example  should  be  something  like  the 
the  following. 

3e,f,x  Isa(e.  Telling ) A Isa(f, Going) 

ATeller(e. Speaker)  ATellee(e. Harry)  AToldThing(e.f) 
AGoer(f , Harry)  ADestination(f:x) 

There  arc  two  interesting  things  to  note  about  this  meaning  representa- 
tion: the  first  is  that  it  consists  of  two  events,  and  the  second  is  that  one  of  the 
participants,  Harry , plays  a role  in  both  of  the  two  events.  The  difficulty  in 
creating  this  complex  representation  falls  to  the  verb  phrase  dominating  the 
verb  tell  which  will  something  like  the  following  as  its  semantic  attachment. 

Xx.y  Xz.  3e  Isa(e.  Telling ) 

ATeller(e.z)  A Tellee(e.x)  A ToldThing(e,y ) 

Semantically,  we  can  interpret  this  subcategorization  frame  for  Tell  as  pro- 
viding three  semantic  roles:  a person  doing  the  telling,  a recipient  of  the 
telling,  and  the  proposition  being  conveyed. 

The  difficult  paid  of  this  example  involves  getting  the  meaning  repre- 
sentation for  the  main  verb  phrase  correct.  As  shown  in  Figure  15.2,  Harry 
plays  the  role  of  both  the  Tel  lee  of  the  Telling  event  and  the  Goer  of  the 
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Going  event.  However,  Harry  is  not  available  when  the  Going  event  is  cre- 
ated within  the  infinitive  verb  phrase. 

Although  there  arc  several  possible  solutions  to  this  problem,  it  is  usu- 
ally best  to  stick  with  a uniform  approach  to  these  problems.  Therefore,  we 
will  start  by  simply  applying  the  semantics  of  the  verb  to  the  semantics  of 
the  other  arguments  of  the  verb  as  follows. 

VP  — > Verb  NP  VP  to  {Verb.sem(NP.sem,  VPto.sem)} 

Since  the  to  in  the  infinitive  verb  phrase  construction  does  not  con- 
tribute to  its  meaning,  we  simply  copy  the  meaning  of  the  child  verb  phrase 
up  to  the  infinitive  verb  phrase.  Recall,  that  we  arc  relying  on  the  unseen 
feature  structures  to  ensure  that  only  the  correct  verb  phrases  can  with  this 
construction. 

VP  to  -»  to  VP  {VP.sem} 

In  this  solution,  the  verb’s  semantic  attachment  has  two  tasks:  incorpo- 
rating the  NP.sem,  the  Goer,  into  the  VPto.sem,  and  incorporating  the  Going 
event  as  the  ToIdThing  of  the  Telling.  The  following  attachment  performs 
both  tasks. 

Verb  — > tell 
{kx,y 
hz 

Be, y. variable  Isa(e,  Telling) 

/\Teller(e,z)  A Tellee(e,x) 

AToIdT king (e, y. variable)  Ay (x) 

In  this  approach,  the  /.-variable  x plays  the  role  of  the  Tellee  of  the  telling 
and  the  argument  to  the  semantics  of  the  infinitive,  which  is  now  contained  as 
a ^-expression  in  the  variable  y.  The  expression  y(x)  represents  a ^-reduction 
that  inserts  Harry  into  the  Going  event  as  the  Goer.  The  notation  y. variable, 
is  analogous  to  the  notation  used  for  complex-term  variables,  and  gives  us 
access  to  the  event  variable  representing  the  Going  event  within  the  infini- 
tive’s meaning  representation. 

Note  that  this  approach  plays  fast  and  loose  with  the  definition  of  A- 
reduction,  in  that  it  allows  ^-expressions  to  be  passed  as  arguments  to  other 
^.-expressions,  when  technically  only  FOPC  terms  can  serve  that  role.  This 
technique  is  a convenience  similar  to  the  use  of  complex  terms  in  that  it  al- 
lows us  to  temporarily  treat  complex  expressions  as  terms  during  the  creation 
of  meaning  representations. 


Section  15.2.  Attachments  for  a Fragment  of  English 
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Prepositional  Phrases 

At  a fairly  abstract  level,  prepositional  phrases  serve  two  distinct  functions: 
they  assert  binary  relations  between  their  heads  and  the  constituents  to  which 
they  arc  attached,  and  they  signal  arguments  to  constituents  that  have  an  ar- 
gument structure.  These  two  functions  argue  for  two  distinct  types  of  prepo- 
sitional phrases  that  differ  based  on  their  semantic  attachments.  We  will 
consider  three  places  in  the  grammar  where  prepositional  phrases  serve  these 
roles:  modifiers  of  noun  phrases,  modifiers  of  verb  phrases,  and  arguments 
to  verb  phrases. 

Nominal  Modifier  Prepositional  Phrases 

Modifier  prepositional  phrases  denote  a binary  relation  between  the  concept 
being  modified,  which  is  external  to  the  prepositional  phrase,  and  the  head  of 
the  prepositional  phrase.  Consider  the  following  example  and  its  associated 
meaning  representation. 

(1)  A restaurant  on  Pearl 

3x  Isa (x,  Restaurant ) A On (x,  Pearl ) 

The  relevant  grammar  rules  that  govern  this  example  arc  the  following. 

NP  — > Det  Nominal 
Nominal  — > Nominal  PP 
PP  — > P NP 

Proceeding  in  a bottom-up  fashion,  the  semantic  attachment  for  this 
kind  of  relational  preposition  should  provide  a two-place  predicate  with  its 
arguments  distributed  over  two  ^-expressions,  as  in  the  following. 

P — > on  {AyAx  On(x,y)} 

With  this  kind  of  arrangement,  the  first  argument  to  the  predicate  is  provided 
by  the  head  of  prepositional  phrase  and  the  second  is  provided  by  the  con- 
stituent that  the  prepositional  phrase  is  ultimately  attached  to.  The  following 
semantic  attachment  provides  the  first  paid. 

PP  — » P NP  {P.sem(NP.sem)} 

This  ^-application  results  in  a new  ^-expression  where  the  remaining  argu- 
ment is  the  inner  ^.-variable. 

This  remaining  argument  can  be  incorporated  using  the  following  nom- 
inal construction. 

Nominal  — > Nominal  PP  {XzNominal.sem(z)  RPP.sem(z)} 
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Verb  Phrase  Modifier  Prepositional  Phrases 

The  general  approach  to  modifying  verb  phrases  is  similar  to  that  of  modi- 
fying nominals.  The  differences  lie  in  the  details  of  the  modification  in  the 
verb  phrase  rule;  the  attachments  for  the  preposition  and  prepositional  phrase 
rules  arc  unchanged.  Let’s  consider  the  phrase  ate  dinner  in  a hurry  which 
is  governed  by  the  following  verb  phrase  rule. 

VP  ->  VP  PP 

The  meaning  representation  of  the  verb  phrase  constituent  in  this  con- 
struction, ate  dinner , is  a ^-expression  where  the  X variable  represents  the  as 
yet  unseen  subject. 

Xx3e  I sa(e, Eating)  AEater(e,x)  A Entente  .Dinner) 

The  representation  of  the  prepositional  phrase  is  also  a ^-expression 
where  the  X variable  is  the  second  argument  in  the  PP  semantics. 

Xx  ln(x.  < 3h  Hurry (/?)  >) 

The  collect  representation  for  the  modified  verb  phrase  should  contain 
the  conjunction  of  these  two  representations  with  the  Eating  event  variable 
filling  the  first  argument  slot  of  the  In  expression.  In  addition,  this  modified 
representation  must  remain  a ^-expression  with  the  unbound  Eater  variable 
as  the  new  A,- variable.  The  following  attachment  expression  fulfills  all  of 
these  requirements. 

VP  — > VP  PP  {XyVP.sem(y)  A PP.sem( VP.sem.  variable)  } 

There  arc  two  aspects  of  this  attachment  that  require  some  elabora- 
tion. The  first  involves  the  application  of  the  constituent  verb  phrases’  X- 
expression  to  the  variable  y.  Binding  the  lower  ^-expression’s  variable  to 
a new  variable  allows  us  to  lift  the  lower  variable  to  the  level  of  the  newly 
created  ^-expression.  The  result  of  this  technique  is  a new  ^-expression  with 
a variable  that,  in  effect,  plays  the  same  role  as  the  original  variable  in  the 
lower  expression.  In  this  case,  this  allows  a ^-expression  to  be  modified 
during  the  analysis  process  before  the  argument  to  the  expression  is  actually 
available. 

The  second  new  aspect  in  this  attachment  involves  the  VP.sem. variable 
notation.  This  notation  is  used  to  access  the  event-variable  representing  the 
underlying  meaning  of  the  verb  phrase,  in  this  case,  e.  This  is  analogous 
to  the  notation  used  to  provide  access  the  various  parts  of  complex-terms 
introduced  earlier. 
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Applying  this  attachment  to  the  current  example  yields  the  following 
representation,  which  is  suitable  for  combination  with  a subsequent  subject 
noun  phrase. 

Xy3e  Isa(e,  Eating)  / \Eater(e,y ) A Entente  .Dinner) 

A In(e,<  3hHurry(h)  >) 

Verb  Argument  Prepositional  Phrases 

The  prepositional  phrases  is  this  category  serve  to  signal  the  role  an  argument 
plays  in  some  larger  event  structure.  As  such,  the  preposition  itself  does  not 
actually  modify  the  meaning  of  the  noun  phrase.  Consider  the  following 
example  of  role  signaling  prepositional  phrases. 

(15.14)  I need  to  go  from  Boston  to  Dallas. 

In  examples  like  this,  the  arguments  to  go  arc  expressed  as  a prepositional 
phrases.  However,  the  meaning  representations  of  these  phrases  should  con- 
sist solely  of  the  unaltered  representation  of  their  head  nouns.  To  handle 
this,  argument  prepositional  phrases  arc  treated  in  the  same  way  that  non- 
branching grammatical  rules  arc;  the  semantic  attachment  of  the  noun  phrase 
is  copied  unchanged  to  the  semantics  of  the  larger  phrase. 

PP  ->  PNP  {NP.sem} 

The  verb  phrase  can  then  assign  this  meaning  representation  to  the  appro- 
priate event  role.  A more  complete  account  of  how  these  argument  bear- 
ing prepositional  phrases  map  to  underlying  event  roles  will  be  presented  in 
Chapter  16. 

15.3  Integrating  Semantic  Analysis  into  the 
Earley  Parser 

In  Section  15.1,  we  suggested  a simple  pipeline  architecture  for  a semantic 
analyzer  where  the  results  of  a complete  syntactic  parse  are  passed  to  a se- 
mantic analyzer.  The  motivation  for  this  notion  stems  from  the  fact  that  the 
compositional  approach  requires  the  syntactic  parse  before  it  can  proceed.  It 
is,  however,  also  possible  to  perform  semantic  analysis  in  parallel  with  syn- 
tactic processing.  This  is  possible  because  in  our  compositional  framework, 
the  meaning  representation  for  a constituent  can  be  created  as  soon  as  all  of 
its  constituent  parts  are  present.  This  section  describes  just  such  an  approach 
to  integrating  semantic  analysis  into  the  Earley  parser  from  Chapter  10. 
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The  integration  of  semantic  analysis  into  an  Earley  parser  is  straight- 
forward and  follows  precisely  the  same  lines  as  the  integration  of  unification 
into  the  algorithm  given  in  Chapter  1 1 . Three  modifications  arc  required  to 
the  original  algorithm: 

• The  rules  of  the  grammar  arc  given  a new  field  to  contain  their  semantic 
attachments. 

• The  states  in  the  chart  arc  given  a new  field  to  hold  the  meaning  repre- 
sentation of  the  constituent. 

• The  Enqueue  function  is  altered  so  that  when  a complete  state  is  en- 
tered into  the  chart  its  semantics  arc  computed  and  stored  in  the  state’s 
semantic  field. 


procedure  ENQUEUE(sfa/e,  chart-entry) 
if  Incomplete  1 (state)  then 

if  state  is  not  already  in  chart-entry  then 
PuSHQfaTe,  chart-entry) 
else  if  Unify- STATE(sfafe)  succeeds  then 
if  APPLY-SEMANTlCS(stafe)  succeeds  then 
if  state  is  not  already  in  chart-entry  then 
PuSHfstafe,  chart-entry) 

procedure  AppLY-SEMANTics(sfafe) 
meaning-rep  t—  APPLYfstafe.  semantic-attachment,  state) 
if  meaning-rep  does  not  equal  failure  then 
state.meaning-rep  ^meaning-rep 


Figure  15.5  The  Enqueue  function  modified  to  handle  semantics.  If 
the  state  is  complete  and  unification  succeeds  then  Enqueue  calls  Apply- 
Semantics  to  compute  and  store  the  meaning  representation  of  completed 
states. 


Figure  15.5  shows  the  Enqueue  and  functions  modified  to  create 
meaning  representations.  When  Enqueue  is  passed  a complete  state  that 
can  successfully  unify  its  unification  constraints  it  calls  Apply-Semantics 
to  compute  and  store  the  meaning  representation  for  this  state.  Note  the  im- 
portance of  performing  feature-structure  unification  prior  to  semantic  analy- 
sis. This  ensures  that  semantic  analysis  will  be  performed  only  on  valid  trees 
and  that  features  needed  for  semantic  analysis  will  be  present. 
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The  primary  advantage  of  this  integrated  approach  over  the  pipeline 
approach  lies  in  the  fact  that  Apply- SEMANTICS  can  fail  in  a manner  similar 
to  the  way  that  unification  can  fail.  If  a semantic  ill-formedness  is  found  in 
the  meaning  representation  being  created,  the  corresponding  state  can  be 
blocked  from  entering  the  chart.  In  this  way,  semantic  considerations  can  be 
brought  to  hear  during  syntactic  processing.  Chapter  16  describes  in  some 
detail  the  various  ways  that  this  notion  of  ill-formedness  can  be  realized. 

Unfortunately,  this  also  illustrates  one  of  the  primary  disadvantages  of 
integrating  semantics  directly  into  the  parser  — considerable  effort  may  be 
spent  on  the  semantic  analysis  of  orphan  constituents  that  do  not  in  the  end 
contribute  to  a successful  parse.  The  question  of  whether  the  gains  made  by 
bringing  semantics  to  hear  early  in  the  process  outweigh  the  costs  involved 
in  performing  extraneous  semantic  processing  can  only  be  answered  on  a 
case  by  case  basis. 


15.4  Idioms  and  Compositionality 

Ce  corps  qui  s’appelait  et  qui  s'appelle  encore  le  saint  empire 
romain  n’etait  en  aucune  maniere  ni  saint,  ni  romain,  ni  empire. 

This  body,  which  called  itself  and  still  calls  itself  the  Holy  Roman 
Empire,  was  neither  Holy,  nor  Roman,  nor  an  Empire. 

-Voltaire3,  1756. 

As  innocuous  as  it  seems,  the  principle  of  compositionality  runs  into  trouble 
fairly  quickly  when  real  language  is  examined.  There  arc  many  cases  where 
the  meaning  of  a constituent  is  not  based  on  the  meaning  of  its  parts,  at  least 
not  in  the  straightforward  compositional  sense.  Consider  the  following  WSJ 
examples. 

(15.15)  Coupons  arc  just  the  tip  of  the  iceberg. 

(15.16)  The  SEC’s  allegations  arc  only  the  tip  of  the  iceberg. 

(15.17)  Coronary  bypass  surgery,  hip  replacement  and  intensive-care  units 
arc  but  the  tip  of  the  iceberg. 

The  phrase  the  tip  of  the  iceberg  in  each  of  these  examples  clearly  doesn’t 
have  much  to  do  with  tips  or  icebergs.  Instead,  it  roughly  means  something 

3 Essai  sur  les  moeurs  et  les  esprit  des  nations.  Translation  by  Y.  Sills,  as  quoted  in  (Sills 
and  Merton,  1991). 
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like  the  beginning.  The  most  straightforward  way  to  handle  idiomatic  con- 
structions like  these  is  to  introduce  new  grammar  rules  specifically  designed 
to  handle  them.  These  idiomatic  rules  mix  lexical  items  with  grammatical 
constituents,  and  introduce  semantic  content  that  is  not  derived  from  any  of 
its  parts. 

Consider  the  following  rule  as  an  example  of  this  approach. 

NP  — > the  tip  of  the  iceberg 
{Beginning} 

The  lower  case  items  on  the  right-hand  side  of  this  rule  arc  intended 
to  represent  precisely  words  in  the  input.  Although,  the  constant  Beginning 
should  not  be  taken  too  seriously  as  a meaning  representation  for  this  idiom, 
it  does  illustrate  the  idea  that  the  meaning  of  this  idiom  is  not  based  on 
the  meaning  of  any  of  its  parts.  Note  that  an  Earley-style  analyzer  with 
this  rule  will  now  produce  two  parses  when  this  phrase  is  encountered:  one 
representing  the  idiom  and  one  representing  the  compositional  meaning. 

Not  surprisingly,  as  with  the  rest  of  the  grammar,  it  may  take  a few  tries 
to  get  to  these  rules  right.  Consider  the  following  iceberg  examples  from  the 
WSJ  corpus. 

(15.18)  And  that’s  but  the  tip  of  Mrs.  Ford’s  iceberg. 

(15.19)  These  comments  describe  only  the  tip  of  a 1,000-page  iceberg. 

(15.20)  The  10  employees  represent  the  merest  tip  of  the  iceberg. 

The  rule  given  above  is  clearly  not  general  enough  to  handle  these  cases. 
These  examples  indicate  that  there  is  a vestigial  syntactic  structure  to  this 
phrase  that  at  permits  some  variation  in  the  determiners  used  and  also  per- 
mits some  adjectival  modification  of  both  the  iceberg  and  the  tip.  A more 
promising  rule  would  be  something  along  the  following  lines. 

NP  —f  TipNP  of  IcebergNP 
{Beginning} 

Here  the  categories  TipNP  and  IcebergNP  can  be  given  an  internal 
nominal-like  structure  that  permits  some  adjectival  modification  and  some 
variation  in  the  determiners,  while  still  restricting  the  heads  of  these  noun 
phrases  to  the  lexical  items  tip  and  iceberg.  Note  that  this  syntactic  solution 
ignores  the  thorny  issue  that  the  modifiers  mere  and  1000-page  seem  to  in- 
dicate that  both  the  tip  and  iceberg  may  in  fact  play  some  compositional  role 
in  the  meaning  of  the  idiom.  We  will  return  to  this  topic  in  Chapter  16,  when 
we  take  up  the  issue  of  metaphor. 

To  summarize,  handling  idioms  requires  at  least  the  following  changes 
to  the  general  compositional  framework. 
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• Allow  the  mixing  of  lexical  items  with  traditional  grammatical  con- 
stituents. 

• Allow  the  creation  of  additional  idiom-specific  constituents  needed  to 
handle  the  correct  range  of  productivity  of  the  idiom. 

• Permit  semantic  attachments  that  introduce  logical  terms  and  predi- 
cates that  arc  not  related  to  any  of  the  constituents  of  the  rule. 

This  discussion  is  obviously  only  the  tip  of  an  enormous  iceberg.  Id- 
ioms are  far  more  frequent  and  far  more  productive  than  is  generally  rec- 
ognized and  pose  serious  difficulties  for  many  applications,  including  as  we 
will  see  in  Chapter  21,  machine  translation. 

15.5  Robust  Semantic  Analysis 

As  we  noted  earlier,  when  syntax-driven  semantic  analysis  is  is  applied  in 
practice,  certain  compromises  have  to  be  made  to  facilitate  system  develop- 
ment and  efficiency  of  operation.  The  following  sections  describe  the  two 
primary  ways  of  instantiating  a syntax-driven  approach  in  practical  systems. 

Semantic  Grammars 

When  we  first  introduced  Frege’s  principle  of  compositionality  in  Section 
15.1,  we  noted  that  the  parts  referred  to  in  that  principle  arc  the  constituents 
provided  by  a syntactic  grammar.  Unfortunately,  the  syntactic  structures 
provided  by  such  grammars  arc  often  not  particularly  well-suited  for  the 
task  of  compositional  semantic  analysis.  This  is  not  particularly  surpris- 
ing since  capturing  elegant  syntactic  generalizations  and  avoiding  overgen- 
eration carry  considerably  more  weight  in  the  design  of  grammars  than  se- 
mantic sensibility  does.  This  mismatch  between  the  structures  provided  by 
traditional  grammars  and  those  needed  for  compositional  semantic  analysis 
typically  manifests  itself  in  the  following  three  ways. 

• Key  semantic  elements  arc  often  widely  distributed  across  parse  trees, 
thus  complicating  the  composition  of  the  required  meaning  represen- 
tation. 

• Parse  trees  often  contain  many  syntactically  motivated  constituents  that 
play  essentially  no  role  in  semantic  processing. 

• The  general  nature  of  many  syntactic  constituents  results  in  semantic 
attachments  that  create  nearly  vacuous  meaning  representations. 
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As  an  example  of  the  first  two  problems,  consider  the  parse  tree  shown 
in  Figure  15.6  for  the  following  BERP  example. 

(15.21)  I want  to  go  to  eat  some  Italian  food  today. 

The  branching  structure  of  this  tree  distributes  the  key  components  of  the 
meaning  representation  widely  throughout  the  tree.  At  the  same  time,  most 
of  the  nodes  in  the  tree  contribute  almost  nothing  to  the  meaning  of  this 
sentence.  This  structure  requires  three  / am  /;  r/«  - c x p re  s s i o n s and  a complex 
term  to  bring  the  few  contentful  elements  together  at  the  top  of  the  tree. 

The  third  problem  arises  from  the  need  to  have  uniform  semantic  at- 
tachments in  the  compositional  rule-to-rule  approach.  This  requirement  of- 
ten results  in  constituents  that  arc  at  the  right  level  of  generality  for  the  syn- 
tax, but  too  high  a level  for  semantic  purposes.  A good  example  of  this  is 
the  case  of  nominal  compounds  and  adjective  phrases,  where  the  semantic 
attachments  arc  so  general  as  to  be  nearly  meaningless.  Consider,  for  exam- 
ple, the  rule  governing  the  phrase  Italian  food  in  our  current  example. 

Nominal  — > Adj  Nominal 

{kx  Nominal. sem(x)  AAM(x,  Adj.sem)} 
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Applying  this  attachment  results  in  the  following  meaning  representation. 

Bx  Isa(x,Food)  A AM (x,  Italian) 

All  nominals  that  fit  this  pattern  receive  the  same  vague  interpretation  that 
roughly  indicates  that  the  nominal  is  modified  by  the  adjective.  This  is  a far 
cry  from  what  know  that  expressions  like  Italian  food  and  Italian  restaurant 
mean;  they  denote  food  prepared  in  a particular  way,  and  restaurants  that 
serve  food  prepared  that  way.  Unfortunately,  there  is  no  way  to  get  this  very 
general  rule  to  produce  such  an  interpretation. 

Both  of  these  problems  can  be  overcome  through  the  use  of  semantic 
grammars,  which  were  originally  developed  for  text-based  dialog  systems 
in  the  domains  of  question-answering  and  intelligent  tutoring  (Brown  and 
Burton,  1975).  Semantic  grammars  that  arc  more  directly  oriented  towards 
serving  the  needs  of  a compositional  analysis.  In  this  approach,  the  rules 
and  constituents  of  the  grammar  arc  designed  to  correspond  directly  to  enti- 
ties and  relations  from  the  domain  being  discussed.  More  specifically,  such 
grammars  arc  constructed  so  that  key  semantic  components  can  occur  to- 
gether within  single  rules,  and  rules  arc  made  no  more  general  than  is  needed 
to  achieve  sensible  semantic  analyses. 

Let’s  consider  how  these  two  general  strategies  might  be  applied  in  the 
BERP  domain.  Consider  the  following  candidate  rule  for  the  particular  kind 
of  information  request  illustrated  in  Example  15.21. 

InfoRequest  — > User  want  to  go  to  eat  FoodType  TimeExpr 

As  with  the  rules  introduced  for  idioms,  rules  of  this  type  freely  mix  non- 
terminals and  terminals  on  their  right-hand  side.  In  this  case,  User , FoodType, 
and  TimeExpr  represent  semantically  motivated  non-terminal  categories  for 
this  domain.  Given  this,  the  semantic  attachment  for  this  rule  would  have  all 
the  information  that  it  needs  to  compose  the  meaning  representation  for  re- 
quests of  this  type  from  the  immediate  constituents  of  the  rule.  In  particular, 
there  is  no  need  for  ^.-expressions,  since  this  flat  rule  elevates  all  the  relevant 
arguments  to  the  top  of  the  tree. 

Now  consider  the  following  rule  that  could  be  used  to  parse  the  the 
phrase  Italian  food  in  our  example. 

FoodType  — > Nationality  FoodType 

The  specific  nature  of  this  rule  permits  a far  more  useful  semantic  attachment 
than  is  possible  with  the  generic  nominal  rule  given  above.  More  specifically, 
it  can  create  a representation  that  states  that  the  food  specified  by  the  con- 
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stituent  FoodType  is  to  prepared  in  the  style  associated  with  the  Nationality 
constituent. 

One  of  the  key  motivations  for  the  use  of  semantic  grammars  in  these 
domains  was  the  need  to  deal  with  various  kinds  of  anaphor  and  ellipsis.  Se- 
mantic grammars  can  help  with  these  phenomena  since  by  their  nature  they 
enable  a certain  amount  of  prediction.  More  specifically,  they  allow  parsers 
to  make  highly  specific  predictions  about  upcoming  input,  based  on  the  cat- 
egories being  actively  predicted  by  the  parser.  Given  this  ability,  anaphoric 
references  and  missing  elements  can  be  associated  with  specific  semantic 
categories. 

As  an  example  of  how  this  works  consider  the  following  ATIS  exam- 
ples. 

(15.22)  When  does  flight  573  arrive  in  Atlanta? 

(15.23)  When  does  it  arrive  in  Dallas? 

Sentences  like  these  can  be  analyzed  with  a rule  like  the  following,  which 
makes  use  of  the  domain  specific  non-terminals  Flight  and  City. 

InfoRequest  — > when  does  Flight  arrive  in  City 

A rule  such  as  this  gives  far  more  information  about  the  likely  referent 
of  the  it,  than  a purely  syntactic  rule  that  would  simply  restrict  it  to  anything 
expressible  as  a noun  phrase.  Operationally,  such  a system  might  search 
back  in  the  dialog  for  places  where  the  Flight  constituent  has  been  recently 
used  to  find  candidate  references  for  this  pronoun.  Chapter  18  discusses  the 
topic  of  anaphor  resolution  in  more  detail. 

Not  surprisingly,  there  arc  a number  of  drawbacks  to  basing  a system 
on  a semantic  grammar.  The  primary  drawback  arises  from  an  almost  com- 
reuse  plete  lack  of  reuse  in  the  approach.  Combining  the  syntax  and  semantics  of 
a domain  into  a single  representation  makes  the  resulting  grammar  specific 
to  that  domain.  In  contrast,  systems  that  keep  their  syntax  and  semantics 
separate  can,  in  principle,  reuse  their  grammars  in  new  domains.  A second 
lack  of  reuse  arises  as  a consequence  of  eschewing  syntactic  generalizations 
in  the  grammar.  This  results  in  an  unavoidable  growth  in  the  size  of  the 
grammar  for  a single  domain.  As  an  example  of  this,  consider  that  whereas 
our  original  noun  phrase  rule  was  sufficient  to  cover  both  Italian  restaurant 
as  well  as  Italian  food,  we  now  need  two  separate  rules  for  these  phrases. 
In  fact,  inspection  of  the  BERP  corpus  reveals  that  we  would  also  need  also 
need  additional  rules  for  vegetarian  restaurant,  California  restaurant,  and 
expensive  restaurant. 
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We  should  also  note  that  semantic  grammars  arc  susceptible  to  a kind 
of  semantic  overgeneration.  As  an  example  of  this,  consider  the  phrase 
Canadian  restaurant.  It  matches  the  rule  given  above  for  ethnic  restaurants, 
and  would  result  in  a meaning  representation  that  specifies  a restaurant  that 
serves  food  prepared  in  the  Canadian  style.  Unfortunately,  this  is  almost 
certainly  an  incorrect  interpretation  of  this  phrase;  none  of  the  occurrences 
of  this  phrase  in  the  WSJ  corpus  had  this  meaning,  all  referring  instead  to 
restaurants  located  within  Canada.  Dialog  systems  that  use  semantic  gram- 
mar's rely  on  the  rarity  of  such  uses  in  restricted  domains. 

Finally,  we  should  note  that  semantic  grammars  probably  should  have 
been  called  something  else,  since  in  practice  the  grammars  themselves  are 
formally  the  same  as  any  other  grammar  formalism  we  have  discussed  in 
this  book.  Correspondingly,  there  are  no  special  algorithms  for  syntactic 
or  semantic  analysis  specific  to  semantic  grammars;  they  can  use  whatever 
algorithms  are  appropriate  for  the  grammar  formalism  being  employed,  such 
as  Earley,  or  any  other  context-free  parsing  algorithm. 

Information  Extraction 

In  language  processing  tasks  such  question-answering,  coming  to  a reason- 
able understanding  of  each  input  sentence  is  vital  since  giving  a user  a wrong 
answer  can  have  serious  consequences.  For  these  tasks,  the  rule-to-rule  ap- 
proach with  an  eye  towards  semantics  is  a good  way  to  build  a complete 
interpretation  of  an  input  sentence. 

However,  other  tasks,  like  extracting  information  about  joint  ventures 
from  business  news,  understanding  weather  reports,  or  summarizing  simple 
information  about  what  happened  today  on  the  stock  market  from  a radio 
report,  do  not  necessarily  require  this  kind  of  detailed  understanding.  Such 
information  extraction  tasks  are  characterized  by  two  properties:  (1)  the 
desired  knowledge  can  be  described  by  a relatively  simple  and  fixed  tem- 
plate, or  frame,  with  slots  that  need  to  be  filled  in  with  material  from  the 
text,  and  (2)  only  a small  part  of  the  information  in  the  text  is  relevant  for 
filling  in  this  frame;  the  rest  can  be  ignored. 

For  example,  one  of  the  tasks  used  in  the  fifth  Message  Understand- 
ing Conference  (MUC-5)  in  1993  (Sundheim,  1993),  a U.S.  Government- 
organized  information  extraction  conference,  was  to  extract  information  about 
international  joint  ventures  from  business  news.  Here  are  the  first  two  sen- 
tences of  a sample  article  from  (Grishman  and  Sundheim,  1995): 

Bridgestone  Sports  Co.  said  Friday  it  has  set  up  a joint  venture  in  Tai- 


INFORMATION 

EXTRACTION 

TEMPLATE 


Methodology  Box:  Evaluating  Inlormation  Ex- 
traction Systems 


The  information  extraction  paradigm  has  much  in  common  with 
the  field  of  information  retrieval  and  has  adapted  several  standard 
evaluation  metrics  from  information  retrieval  including  precision, 
recall,  fallout,  and  a combined  metric  called  an  F-measure. 

Recall  is  a measure  of  how  much  relevant  information  the  sys- 
tem has  extracted  from  the  text;  it  is  thus  a measure  of  the  coverage 
of  the  system.  Recall  is  defined  as  follows: 


Recall: 


# of  correct  answers  given  by  system 
total  # of  possible  correct  answers  in  the  text 


Precision  is  a measure  of  how  much  of  the  information  that  the  sys- 
tem returned  is  actually  correct,  and  is  also  known  as  accuracy.  Pre- 
cision is  defined  as  follows: 

T,  . . # of  correct  answers  given  by  system 

# ot  answers  given  by  system 

Fallout  is  a measure  of  the  systems  ability  to  ignore  spurious  infor- 
mation in  the  text.  It  is  defined  as  follows: 

P ..  _ # of  incorrect  answers  given  by  system 

a ou  . — # 0f  spurious  facts  in  the  text 


Note  that  recall  and  precision  are  antagonistic  to  one  another 
since  a conservative  system  that  strives  for  perfection  in  terms  of 
precision  will  invariably  lower  its  recall  score.  Similarly,  a system 
that  strives  for  coverage  will  get  more  things  wrong,  thus  lowering 
its  precision  score.  This  situation  has  led  to  the  use  of  a combined 
measure  called  the  F-measure  that  balances  recall  and  precision  by 
using  a parameter  [3.  The  F-measure  is  defined  as  follows: 

F (P2  + l)^ 

P2P  + R 

When  P is  one,  precision  and  recall  arc  given  equal  weight.  When  p 
is  greater  than  one,  precision  is  favored,  and  when  P is  less  than  one, 
recall  is  favored. 
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TIE-UP-1: 

Relationship: 

TIE-UP 

Entities: 

“Bridgestone  Sports  Co.” 

“a  local  concern” 

“a  Japanese  trading  house” 

Joint  Venture  Company 

“Bridgestone  Sports  Taiwan  Co.” 

Activity 

ACTIVITY- 1 

Amount 

NT$20000000 

ACTIVITY-1: 

Company 

“Bridgestone  Sports  Taiwan  Co.” 

Product 

“iron  and  “metal  wood”  clubs” 

Start  Date 

DURING:  January  1990 

Figure  15.7  The  templates  produced  by  the  FASTUS  (Hobbs  et  al.,  1997) 

information  extraction  engine  given  the  input  text  on  page  575. 

wan  with  a local  concern  and  a Japanese  trading  house  to  produce  golf 
clubs  to  be  shipped  to  Japan. 

The  joint  venture,  Bridgestone  Sports  Taiwan  Co.,  capitalized  at  20 
million  new  Taiwan  dollars,  will  start  production  in  January  1990  with 
production  of  20,000  iron  and  “metal  wood”  clubs  a month. 

The  output  of  an  information  extraction  system  can  be  a single  template 
with  a certain  number  of  slots  filled  in,  or  a more  complex  hierarchically  re- 
lated set  of  objects.  The  MUC-5  task  specified  this  latter  more  complex  out- 
put, requiring  systems  to  produce  hierarchically  linked  templates  describing 
the  participants  in  the  joint  venture,  the  resulting  company,  and  its  intended 
activity,  ownership  and  capitalization.  Figure  15.7  shows  the  resulting  struc- 
ture produced  by  the  FASTUS  system  (Hobbs  et  al.,  1997). 

Many  information  extraction  systems  arc  built  around  cascades  of  finite-  cascades 
state  automata.  The  FASTUS  system,  for  example,  produces  the  template 
given  above,  based  on  a cascade  in  which  each  level  of  linguistic  process- 
ing extracts  some  information  from  the  text,  which  is  passed  on  to  the  next 
higher  level,  as  shown  in  Figure  15.8 

Many  systems  base  all  or  most  of  these  levels  on  finite-automata,  al- 
though in  practice  most  complete  systems  arc  not  technically  finite-state, 
either  because  the  individual  automata  arc  augmented  with  feature  registers 
(as  in  FASTUS),  or  because  they  arc  used  only  as  preprocessing  steps  for  full 
parsers  (e.g.  Gaizauskas  et  ah,  1995;  Weischedel,  1995)  indexGaizauskas, 

R.),  or  arc  combined  with  other  components  based  on  decision-trees  (Fisher 
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No. 

Step 

Description 

1 

Tokens: 

Transfer  an  input  stream  of  characters 
into  a token  sequence. 

2 

Complex  Words: 

Recognize  multi-word  phrases,  numbers, 
and  proper  names. 

3 

Basic  phrases: 

Segment  sentences  into  noun  groups, 
verb  groups,  and  particles. 

4 

Complex  phrases: 

Identify  complex  noun  groups  and  com- 
plex verb  groups. 

5 

Semantic  Patterns: 

Identify  semantic  entities  and  events  and 
insert  into  templates. 

6 

Merging: 

Merge  references  to  the  same  entity  or 
event  from  different  parts  of  the  text. 

Figure  15.8  Levels  of  processing  in  FASTUS(Hobbs  et  al.,  1997).  Each 
level  extracts  a specific  type  of  information  which  is  then  passed  on  to  the  next 
higher  level. 

et  al,  1995). 

Let’s  sketch  the  FASTUS  implementation  of  each  of  these  levels,  fol- 
lowing Hobbs  et  al.  (1997)  and  Appelt  et  al.  (1995).  After  tokenization,  the 
second  level  recognizes  multiwords  like  set  up,  and  joint  venture,  and  names 
like  Bridgestone  Sports  Co. . The  name  recognizer  is  a transducer,  composed 
of  a large  set  of  specific  mappings  designed  to  handle  locations,  personal 
names,  and  names  of  organizations,  companies,  unions,  performing  groups, 
etc.  The  following  are  typical  rules  for  modeling  names  of  performing  or- 
ganizations like  San  Francisco  Symphony  Orchestra  and  Canadian  Opera 
Company.  While  the  rules  arc  written  using  a context-free  syntax,  there  is 
no  recursion  and  therefore  they  can  be  automatically  compiled  into  finite- 
state  transducers: 


Performer-Org 

pre-location 

locname 

Perf-Org-Suffix 

Performer-Noun 

nationality 

city 


(pre-location)  Performer-Noun+  Perf-Org-Suffix 

locname  | nationality 

city  | region 

orchestra,  company 

symphony,  opera 

Canadian,  American,  Mexican 

San  Francisco,  London 


The  second  stage  also  might  transduce  sequences  like  forty  two  into 
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the  appropriate  numeric  value  (recall  the  discussion  of  this  problem  on  page 
124  in  Chapter  5). 

The  third  FASTUS  stage  produces  a series  of  basic  phrases,  such  as 
noun  groups,  verb  groups,  etc.,  using  finite-state  rules  of  the  sort  shown  on 
page  386.  The  output  of  the  FASTUS  basic  phrase  identifier  is  shown  in  Fig- 
ure 15.9;  note  the  use  of  some  domain-specific  basic  phrases  like  Company 
and  Location. 


Company 

Bridgestone  Sports  Co. 

Verb  Group 

said 

Noun  Group 

Friday 

Noun  Group 

it 

Verb  Group 

had  set  up 

Noun  Group 

a joint  venture 

Preposition 

in 

Location 

Taiwan 

Preposition 

with 

Noun  Group 

a local  concern 

Conjunction 

and 

Noun  Group 

a Japanese  trading  house 

Verb  Group 

to  produce 

Noun  Group 

golf  clubs 

Verb  Group 

to  be  shipped 

Preposition 

to 

Location 

Japan 

Figure  15.9  The  output  of  Stage  2 of  the  FASTUS  basic -phrase  extractor, 
which  uses  finite-state  rules  of  the  sort  described  by  Appelt  and  Israel  (1997) 
and  shown  on  page  386. 


Recall  that  Chapter  10  described  how  these  basic  phrases  can  be  com- 
bined into  complex  noun  groups  and  verb  groups.  This  is  accomplished  in 
Stage  4 of  FASTUS,  by  dealing  with  conjunction  and  with  the  attachment  of 
measure  phrases  as  in  the  following. 

20,000  iron  and  ’’metal  wood”  clubs  a month, 
and  preposition  phrases: 

production  of  20,000  iron  and  ’’metal  wood”  clubs  a month. 

The  output  of  Stage  4 is  a list  of  complex  noun  groups  and  verb  groups. 
Stage  5 takes  this  list,  ignoring  all  input  that  has  not  been  chunked  into  a 
complex  group,  recognizes  entities  and  events  in  the  complex  groups,  and 
inserts  the  recognized  objects  into  the  proper  templates.  The  recognition  of 


BASIC 

PHRASES 
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(1) 

Relationship: 

Entities: 

TIE-UP 

“Bridgestone  Sports  Co.” 
“a  local  concern” 

“a  Japanese  trading  house” 

(2) 

Activity 

PRODUCTION 

Product 

“golf  clubs” 

(3) 

Relationship: 

TIE-UP 

Joint  Venture  Company 

“Bridgestone  Sports  Taiwan  Co.” 

Amount 

NT$20000000 

(4) 

Activity 

PRODUCTION 

Company 

“Bridgestone  Sports  Taiwan  Co.” 

Start  Date 

DURING:  January  1990 

(5) 

Activity 

PRODUCTION 

Product 

“iron  and  “metal  wood”  clubs” 

Figure  15.10  The  five  partial  templates  produced  by  Stage  5 of  the  FASTUS 

system.  These  templates  will  be  merged  by  the  Stage  6 Merging  algorithm  to 

produce  the  final  template  shown  in 

Figure  15.7  on  page  577. 

entities  and  events  is  done  by  hand-coded  finite-state  automata  whose  tran- 
sitions arc  based  on  particular  complex-phrase  types  annotated  by  particular 
head  words  or  particular-  features  like  company , currency,  or  date. 

For  example,  the  first  sentence  of  the  news  story  above  realizes  the 
semantic  patterns  based  on  the  following  two  regular  expressions  (where 
NG  indicates  Noun-Group  and  VG  Verb-Group). 

• NG(  Company /ies)  VG(Set-up)  NG(Joint- Venture)  with  NG(Company/ies) 

• VG(  Produce)  NG(  Product) 

The  second  sentence  realizes  the  second  pattern  above  as  well  as  the  follow- 
ing two  patterns: 

• NG(  Company)  VG-Passive(Capitalized)  at  NG(Currency) 

• NG(Company)  VG(Start)  NG(Activity)  in/on  NG(Date) 

The  result  of  processing  these  two  sentences  is  the  set  of  five  draft  tem- 
plates shown  in  Figure  15. 10.  These  five  templates  must  then  be  merged  into 
the  single  hierarchical  structure  shown  in  Figure  15.7.  The  merging  algo- 
rithm decides  whether  two  activity  or  relationship  structures  are  sufficiently 
consistent  that  they  might  be  describing  the  same  events,  and  merges  them 
if  so.  Since  the  merging  algorithm  must  perform  reference  resolution  (de- 
ciding when  it  is  the  case  that  two  descriptions  refer  to  the  same  entity),  we 
defer  description  of  this  level  to  Chapter  18. 
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Domain-specific  templates  of  the  kind  we  have  described  in  this  sec- 
tion have  also  been  used  in  many  limited-domain  semantic  understanding 
and  discourse  comprehension  tasks,  including  managing  mixed  dialog  in 
question-answering  systems  (Bobrow  et  al. , 1977). 


15.6  Summary 

This  chapter  explores  the  notion  of  syntax-driven  semantic  analysis.  Among 
the  highlights  of  this  chapter  arc  the  following  topics. 

• Semantic  analysis  is  the  process  whereby  meaning  representations  arc 
created  and  assigned  to  linguistic  inputs. 

• Semantic  analyzers  that  make  use  of  static  knowledge  from  the  lexicon 
and  grammar  can  create  context  independent  literal,  or  conventional, 
meanings. 

• The  Principle  of  Compositionality  states  that  the  meaning  of  a sentence 
can  be  composed  from  the  meanings  of  its  parts. 

• In  syntax-driven  semantic  analysis,  the  parts  arc  the  syntactic  con- 
stituents on  an  input. 

• Compositional  creation  of  FOPC  formulas  is  possible  with  a few  nota- 
tional  extensions  including  ^-expressions  and  complex  terms. 

• Natural  language  quantifiers  introduce  a kind  of  ambiguity  that  is  dif- 
ficult to  handle  compositionally.  Complex  terms  can  be  used  to  com- 
pactly encode  this  ambiguity. 

• Idiomatic  language  defies  the  principle  of  compositionality  but  can  eas- 
ily be  handled  by  adapting  the  techniques  used  to  design  grammar  rules 
and  their  semantic  attachments. 

• Practical  semantic  analysis  systems  adapt  the  strictly  compositional 
approach  in  a number  of  ways. 

- Dialog  systems  based  on  semantic  grammars  rely  on  grammars 
that  have  been  written  to  serve  the  needs  of  semantics  rather  than 
syntactic  generality. 

- Information  extraction  systems  based  on  cascaded  automata  can 
extract  pertinent  information  while  ignoring  irrelevant  parts  of  the 
input. 
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Bibliographical  and  Historical  Notes 


As  noted  earlier,  the  principle  of  compositionality  is  traditionally  attributed 
to  Frege;  Janssen  (1997)  discusses  this  attribution.  Using  the  categorial 
grammar  framework  described  in  Chapter  12,  Montague  (1973)  demonstrated 
that  a compositional  approach  could  be  systematically  applied  to  an  inter- 
esting fragment  of  natural  language.  The  rule-to-rule  hypothesis  was  first 
articulated  by  (Bach,  1976).  On  the  computational  side  of  things,  Woods’s 
Lunar  system  (Woods,  1977)  was  based  on  a pipelined  syntax-first  com- 
positional analysis.  Schubert  and  Pelletier  (1982)  developed  an  incremental 
rule-to-rule  system  based  on  Gazdar’s  GPSG  approach  (Gazdar,  1981,  1982; 
Gazdar  et  al.,  1985).  Main  and  Benson  (1983)  extended  Montague’s  ap- 
proach to  the  domain  of  question-answering. 

In  one  of  the  all  too  frequent  cases  of  parallel  development,  researchers 
in  programming  languages  developed  essentially  identical  compositional  tech- 
niques to  aid  in  the  design  of  compilers.  Specifically,  Knuth  (1968)  intro- 
duced the  notion  of  attribute  grammars  that  associate  semantic  structures 
with  syntactic  structures  in  a one  to  one  correspondence.  As  a consequence, 
the  style  of  semantic  attachments  used  in  this  chapter  will  be  familial-  to  users 
of  the  YACC-style  (Johnson  and  Lesk,  1978)  compiler  tools. 

Semantic  Grammars  are  due  to  Burton  (Brown  and  Burton,  1975). 
Similar  notions  developed  around  the  same  time  included  Pragmatic  Gram- 
mar's (Woods,  1977),  and  Performance  Grammars  (Robinson,  1975).  All 
centered  around  the  notion  of  reshaping  syntactic  grammars  to  serve  the 
needs  of  semantic  processing.  It  is  safe  to  say  that  most  modern  systems 
developed  for  use  in  limited  domains  make  use  of  some  form  of  semantic 
grammar. 

Most  of  the  techniques  used  in  the  fragment  of  English  presented  in 
Section  15.2  are  adapted  from  SRI’s  Core  Language  Engine  (Alshawi,  1992). 
Additional  bits  and  pieces  were  adapted  from  (Woods,  1977;  Schubert  and 
Pelletier,  1982;  Gazdar  et  al,  1985).  Of  necessity,  a large  number  of  im- 
portant topics  were  not  covered  in  this  chapter.  See  (Alshawi,  1992)  for 
the  standard  gap-threading  approach  to  semantic  interpretation  in  the  pres- 
ence of  long-distance  dependencies,  ter  Meulen  (1995)  presents  an  up  to 
date  treatment  of  tense,  aspect,  and  the  representation  of  temporal  informa- 
tion. Extensive  coverage  of  approaches  to  quantifier  scoping  can  be  found 
in  (Hobbs  and  Shieber,  1987;  Alshawi,  1992).  van  Lehn  (1978)  presents  a 
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set  of  human  preferences  for  quantifier  scoping.  Over  the  years,  a consider- 
able amount  of  effort  has  been  directed  toward  the  interpretation  of  nominal 
compounds.  Linguistic  research  on  this  topic  can  be  found  in  (Lees,  1970; 
Downing,  1977;  Levi,  1978;  Ryder,  1994),  more  computational  approaches 
arc  described  in  (Gershman,  1977;  Finin,  1980;  McDonald,  1982;  Pierre, 
1984;  Arens  et  al,  1987;  Wu,  1992;  Vanderwende,  1994;  Lauer,  1995). 

There  is  a long  and  extensive  literature  on  idioms.  Fillmore  et  al 
(1988)  describe  a general  grammatical  framework  that  places  idioms  at  the 
the  center  of  its  underlying  theory.  Makkai  (1972)  presents  an  extensive 
linguistic  analysis  of  many  English  idioms.  Hundreds  of  idiom  dictionar- 
ies for  second  language  learners  arc  also  available.  On  the  computational 
side,  Becker  (1975)  was  among  the  first  to  suggest  the  use  of  phrasal  rules 
in  parsers.  Wilensky  and  Arens  (1980)  were  among  the  first  to  successfully 
make  use  of  this  notion.  Zernik  (1987)  demonstrated  a system  that  could 
learn  such  phrasal  idioms  in  context.  A collection  of  papers  on  computa- 
tional approaches  to  idioms  appeared  in  (Fass  et  al,  1992). 

The  first  work  on  information  extraction  was  performed  in  the  context 
of  the  Frump  system  (DeJong,  1982).  Later  work  was  stimulated  by  the 
U.S  government  sponsored  MUC  conferences  (Sundheim,  1991,  1992,  1993, 
1995b).  Chinchor  et  al.  (1993)  describes  the  evaluation  techniques  used  in 
the  MUC-3  and  MUC-4  conferences.  Hobbs  (1997)  partially  credits  the 
inspiration  for  FASTUS  to  the  success  of  the  University  of  Massachusetts 
CIRCUS  system  (Lehnert  et  al,  1991)  in  MUC-3.  The  SCISOR  system  is 
another  system  based  loosely  on  cascades  and  semantic  expectations  that 
did  well  in  MUC-3  (Jacobs  and  Rau,  1990).  Due  to  the  lack  of  reuse  from 
one  domain  to  another  in  information  extraction,  a considerable  amount  of 
work  has  focused  on  automating  the  process  of  knowledge  acquisition  in  this 
area.  A variety  of  supervised  learning  approaches  arc  described  in  (Cardie, 
1993,  1994;  Riloff,  1993;  Soderland  et  al,  1995;  Huffman,  1996;  Freitag, 
1998). 

Finally,  we  have  skipped  an  entire  branch  of  semantic  analysis  in  which 
expectations  driven  from  deep  meaning  representations  drive  the  analysis 
process.  Such  systems  avoid  the  direct  representation  and  use  of  syntax, 
rarely  making  use  of  anything  resembling  a parse  tree.  The  earliest  and  most 
successful  efforts  along  these  lines  were  developed  by  Simmons  (1973b, 
1978,  1983)  and  (Wilks,  1975a,  1975c).  A series  of  similar  approaches  were 
developed  by  Roger  Schank  and  his  students  (Riesbeck,  1975;  Bimbaum  and 
Selfridge,  1981;  Riesbeck,  1986).  In  these  approaches,  the  semantic  analysis 
process  is  guided  by  detailed  procedures  associated  with  individual  lexical 
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items.  The  CIRCUS  information  extraction  system  (Lehnert  et  al.,  1991) 
traces  its  roots  to  these  systems. 


Exercises 

15.1  The  attachment  given  on  page  560  to  handle  noun  phrases  with  com- 
plex determiners  is  not  general  enough  to  handle  most  possessive  noun  phrases. 
Specifically,  it  doesn’t  work  for  phrases  like  the  following. 

a.  My  sister’s  flight 

b.  My  fiance’s  mother’s  flight 

Create  a new  set  of  semantic  attachments  to  handle  cases  like  these. 

15.2  Develop  a set  of  grammar  rules  and  semantic  attachments  to  handle 
predicate  adjectives  such  as  the  one  following. 

a.  Flight  308  from  New  York  is  expensive. 

b.  Murphy’s  restaurant  is  cheap. 

15.3  None  of  the  attachments  given  in  this  chapter  provide  temporal  infor- 
mation. Augment  a small  number  of  the  most  basic  rules  to  add  temporal 
information  along  the  lines  sketched  in  Chapter  14.  Use  your  rules  to  create 
meaning  representations  for  the  following  examples. 

a.  Flight  299  departed  at  9 o’clock. 

b.  Flight  208  will  arrive  at  3 o’clock. 

c.  Flight  1405  will  arrive  late. 

15.4  As  noted  in  Chapter  14,  the  present  tense  in  English  can  be  used  to 
refer  to  either  the  present  or  the  future.  However,  it  can  also  be  used  to 
express  habitual  behavior,  as  in  the  following. 

Flight  208  leaves  at  3 o’clock. 

This  could  be  a simple  statement  about  today’s  Flight  208,  or  alterna- 
tively it  might  state  that  this  flight  leaves  at  3 o’clock  every  day.  Create  a 
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FOPC  meaning  representation  along  with  appropriate  semantic  attachments 
for  this  habitual  sense. 

15.5  Implement  the  Earley-based  semantic  analyzer  described  in  Section 
15.3. 

15.6  It  has  been  claimed  that  it  is  not  necessary  to  explicitly  list  the  seman- 
tic attachment  for  most  grammar  rules.  Instead,  the  semantic  attachment  for 
a rule  should  be  inferable  from  the  semantic  types  of  the  rule’s  constituents. 
For  example,  if  a rule  has  two  constituents  where  one  is  a single  argument 
^-expression  and  the  other  is  a constant  then  the  semantic  attachment  should 
obviously  apply  the  ^-expression  to  the  constant.  Given  the  attachments  pre- 
sented in  this  chapter,  does  this  type-driven  semantics  seem  like  a reasonable 
idea? 

15.7  Add  a simple  type-driven  semantics  mechanism  to  the  Earley  analyzer 
you  implemented  for  Exercise  15.5 

15.8  Using  a phrasal  search  on  your  favorite  Web  search  engine,  collect  a 
small  corpus  of  the  tip  of  the  iceberg  examples.  Be  certain  that  you  search 
for  an  appropriate  range  of  examples  (ie.  don’  just  search  of  “the  tip  of  the 
iceberg”.)  Analyze  these  examples  and  come  up  with  a set  of  grammar  rules 
that  correctly  accounts  for  them. 

15.9  Collect  a similar  corpus  of  examples  for  the  idiom  miss  the  boat.  An- 
alyze these  examples  and  come  up  with  a set  of  grammar  rules  that  correctly 
accounts  for  them. 

15.10  There  arc  now  a fair  number  of  Web-based  natural  language  question 
answering  services  that  purport  to  provide  answers  to  questions  on  a wide 
range  of  topics  (see  this  book’s  Web  page  for  pointers  to  current  services.) 
Develop  a corpus  of  questions  for  some  general  domain  of  interest  and  use 
it  to  evaluate  one  or  more  of  these  services.  Report  your  results.  What 
difficulties  did  you  encounter  in  applying  the  standard  evaluation  techniques 
to  this  task? 

15.11  Collect  a small  corpus  of  weather  reports  from  your  local  newspaper 
or  the  Web.  Based  on  an  analysis  of  this  corpus,  create  a set  of  frames 
sufficient  to  capture  the  semantic  content  of  these  reports. 

15.12  Implement  and  evaluate  a small  information  extraction  system  for 
the  weather  report  corpus  you  collected  for  the  last  exercise. 
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‘When  I use  a word,’  Humpty  Dumpty  said  in  rather  a scornful 
tone,  ‘it  means  just  what  I choose  it  to  mean  - neither  more  nor 
less.’ 


Lewis  Carrol’s  Alice  in  Wonderland 

How  many  legs  does  a dog  have  if  you  call  its  tail  a leg? 
Four. 

Calling  a tail  a leg  doesn ’t  make  it  one. 

Attributed  to  Abraham  Lincoln 


A revised  version  of  this  chapter  will  be  available  shortly. 

The  previous  two  chapters  focused  on  representing  and  creating  mean- 
ing representations  for  entire  sentences.  In  those  discussions,  we  made  min- 
imal use  of  the  notion  of  the  meaning  of  a word.  Words  and  their  meanings 
were  of  interest  solely  to  the  extent  that  they  provided  the  appropriate  bits 
and  pieces  necessary  to  construct  adequate  meaning  representations  for  en- 
tire sentences.  This  general  approach  is  motivated  by  the  view  that  while 
words  may  contribute  content  to  the  meanings  of  sentences,  they  do  not 
themselves  have  meanings.  By  this  we  mean  that  words,  by  themselves, 
do  not  refer  to  the  world,  can  not  be  judged  to  be  true  or  false,  or  literal 
or  figurative,  or  a host  of  other  things  that  arc  generally  reserved  to  entire 
sentences  and  utterances.  This  narrow  conception  of  the  role  of  words  in  a 
semantic  theory  leads  to  a view  of  the  lexicon  as  a simple  listing  of  symbolic 
fragments  devoid  of  any  systematic  structure. 
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The  topics  presented  in  this  chapter  serve  to  illustrate  how  much  is 
missed  by  this  narrow  view.  As  we  will  see,  the  lexicon  has  a highly  system- 
atic structure  that  governs  what  words  can  mean,  and  how  they  can  be  used. 
This  structure  consists  of  relations  among  words  and  their  meanings,  as  well 
as  the  internal  structure  of  individual  words.  The  study  of  this  systematic, 
meaning  related,  structure  is  called  Lexical  Semantics. 

Before  moving  on,  we  will  first  introduce  a few  new  terms,  since  the 
ones  we  have  been  using  thus  far  arc  entirely  too  vague.  In  particular,  the 
word  word  has  by  now  been  used  in  so  many  different  ways  that  it  will 
prove  difficult  to  make  unambiguous  use  of  it  in  this  chapter.  Instead,  we 
will  focus  on  the  notion  of  a lexeme,  an  individual  entry  in  the  lexicon. 
A lexeme  should  be  thought  of  as  a pairing  of  a particular  orthographic  and 
phonological  form  with  some  form  of  symbolic  meaning  representation.  The 
lexicon  is  therefore  a finite  list  made  up  of  lexemes.  When  appropriate,  we 
will  use  the  terms  orthographic  form,  and  phonological  form,  to  refer  to  the 
appropriate  form  paid  of  this  pairing,  and  the  term  sense  to  refer  to  a lexeme’s 
meaning  component.  Note  that  these  definitions  will  undergo  a number  of 
refinements  as  needed  in  later  sections. 

Given  this  minimal  nomenclature,  let  us  return  to  the  topic  of  what 
facts  we  can  discover  about  lexemes  that  arc  relevant  to  the  topic  of  meaning. 
A fruitful  place  to  staid  such  an  exploration  is  a dictionary.  Dictionaries  are, 
after  all,  nothing  if  not  repositories  of  information  about  the  meanings  of 
lexemes.  Within  dictionaries,  it  turns  out  that  the  most  interesting  place  to 
look  first  is  at  the  definitions  of  lexemes  that  no  one  ever  actually  looks  up. 
For  example,  consider  the  following  fragments  from  the  definitions  of  right , 
left , red,  blood  from  the  American  Heritage  D ic tin na ry (Morris,  1985). 

right  adj  located  nearer  the  right  hand  esp.  being  on  the  right  when  facing 
the  same  direction  as  the  observer, 
left  adj  located  nearer  to  this  side  of  the  body  than  the  right, 
red  n the  color  of  blood  or  a ruby. 

blood  n the  red  liquid  that  circulates  in  the  heart,  arteries  and  veins  of  animals. 

The  first  thing  to  note  about  these  definitions  is  the  surprising  amount 
of  circularity  in  them.  The  definition  of  right  makes  two  direct  references  to 
itself,  while  the  entry  for  left  contains  an  implicit  self-reference  in  the  phrase 
this  side  of  the  body,  which  presumably  means  the  left  side.  The  entries  for 
red  and  blood  avoid  this  kind  of  direct  self-reference  by  instead  referencing 
each  other  in  their  definitions.  Such  circularity  is,  of  course,  inherent  in  all 
dictionary  definitions,  these  examples  are  just  extreme  cases.  In  the  end,  all 
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definitions  arc  stated  in  terms  of  lexemes  that  arc,  in  turn,  defined  in  terms 
of  other  lexemes. 

From  a purely  formal  point  of  view,  this  inherent  circularity  is  evidence 
that  what  dictionaries  entries  provide  arc  not,  in  fact,  definitions  at  all.  They 
arc  simply  descriptions  of  lexemes  in  terms  of  other  lexemes,  with  the  hope 
being  that  the  user  of  the  dictionary  has  sufficient  grasp  of  these  other  terms 
to  make  the  entry  in  question  sensible.  As  is  obvious  with  lexemes  like  red 
and  right,  this  approach  will  fail  without  some  ultimate  grounding  in  the 
external  world. 

Fortunately,  even  with  this  limitation,  there  is  still  a wealth  of  semantic 
information  contained  in  these  kinds  of  definitions.  For  example,  the  above 
definitions  make  it  clear-  that  right  and  left  are  similar-  kinds  of  lexemes  that 
stand  in  some  kind  of  alternation,  or  opposition,  to  one  another.  Similarly, 
we  can  glean  that  red  is  a color,  it  can  be  applied  to  both  blood  and  rubies, 
and  that  blood  is  a liquid.  As  we  will  see  in  this  chapter,  given  a sufficiently 
large  database  of  facts  such  as  these,  many  applications  are  quite  capable 
of  performing  sophisticated  semantic  tasks  (even  if  they  do  not  really  know 
their  right  from  their  left.) 

To  summarize,  we  can  capture  quite  a bit  about  the  semantics  of  in- 
dividual lexemes  by  analyzing  and  labeling  their  relations  to  other  lexemes 
in  various  settings.  We  will,  in  particular-,  be  interested  in  accounting  for 
the  similarities  and  differences  among  different  lexemes  in  similar-  settings, 
and  the  nature  of  the  relations  among  lexemes  in  a single  setting.  This  lat- 
ter topic  will  lead  us  to  examine  the  idea  that  lexemes  are  not  unanalyzable 
atomic  symbols,  but  rather  have  an  internal  structure  that  governs  their  com- 
binatoric possibilities.  Later,  in  Section  16.4,  we  will  take  a closer  look  at 
the  notion  of  creativity,  or  generativity,  and  the  lexicon.  There  we  will  ex- 
plore the  notion  that  the  lexicon  should  not  be  thought  of  as  a finite  listing, 
but  rather  as  a creative  generator  of  infinite  meanings. 

Before  proceeding,  we  should  note  that  the  view  of  lexical  seman- 
tics presented  here  is  not  oriented  solely  towards  improving  computational 
applications  of  the  more  restrictive  “only  sentences  have  meaning”  variety. 
Rather,  as  we  will  see,  it  lends  itself  to  a wide  array  of  applications  that  in- 
volve the  use  of  words,  and  that  could  can  be  improved  by  some  knowledge 
of  their  meanings. 
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16.1  Relations  Among  Lexemes  and  Their  Senses 

The  section  explores  a variety  of  relations  that  hold  among  lexemes  and 
among  their  senses.  The  list  of  relations  presented  here  is  by  no  means 
exhaustive;  the  emphasis  is  on  those  relations  that  have  had  significant  com- 
putational implications.  As  we  will  see,  the  primary  analytic  tool  we  will 
use  involves  the  systematic  substitution  of  one  lexeme  for  another  in  some 
setting.  The  results  of  such  substitutions  can  reveal  the  presence  or  absence 
of  a specific  relationship  between  the  substituted  lexemes. 

Homonymy 

homonymy  We  begin  this  section  with  a discussion  of  homonymy,  perhaps  the  sim- 
plest, and  semantically  least  interesting,  relation  to  hold  between  lexemes. 
Traditionally,  homonymy  is  defined  as  a relation  that  holds  between  words 
that  have  the  same  form  with  unrelated  meanings.  The  items  taking  part  in 
homonyms  such  a relation  arc  called  homonyms.  A classic  example  of  homonymy  is 
bank  with  its  distinct  financial  institution  and  sloping  mound  meanings,  as 
illustrated  in  the  following  WSJ  examples. 

(16. 1)  Instead,  a bank  can  hold  the  investments  in  a custodial  account  in  the 
client’s  name. 

(16.2)  But  as  agriculture  burgeons  on  the  east  bank , the  river  will  shrink 
even  more. 

Loosely  following  lexicographic  tradition,  we  will  denote  this  relationship 
by  placing  a superscript  on  the  orthographic  form  of  the  word  as  in  bank1 
and  bank2.  This  notation  indicates  that  these  arc  two  separate  lexemes,  with 
distinct  and  unrelated  meanings,  that  happen  to  share  an  orthographic  form. 

It  will  come  as  little  surprise  that  any  definition  this  simple  will  prove 
to  be  problematic  and  will  need  to  be  refined.  In  the  following  discussion, 
we  will  explore  this  definition  by  examining  pairs  of  words  that  satisfy  it, 
but  which  for  a number  of  reasons  seem  to  be  marginal  examples.  We  will 
begin  by  focusing  solely  on  issues  of  form,  returning  later  to  the  topic  of 
meaning  Note  that  while  this  may  seem  like  an  odd  choice  given  the  topic  of 
this  chapter,  these  discussions  will  serve  to  introduce  a number  of  important 
distinctions  needed  in  later  sections.  In  this  discussion,  we  will  be  primarily 
concerned  with  how  well  our  definition  of  homonymy  assists  us  in  identify- 
ing and  characterizing  those  lexemes  which  will  lead  to  ambiguity  problems 
for  various  applications. 
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Returning  to  the  bank  example,  the  first  thing  to  note  is  that  bank1  and 
bank2  arc  identical  in  both  their  orthographic  and  phonological  forms.  Of 
course,  there  arc  also  pairs  of  lexemes  with  distinct  meanings  which  do  not 
share  both  forms.  For  example,  pairs  like  wood  and  would , and  be  and  bee, 
arc  pronounced  the  same  but  arc  spelled  differently.  Indeed,  as  we  saw  in 
Chapter  5,  when  pronunciation  in  context  is  taken  into  account,  the  situation 
is  even  worse.  Recall,  that  the  lexemes  knee,  need,  neat,  new,  you,  the,  and 
to  can  all  be  pronounced  as  [ni],  given  the  right  context.  Clearly,  if  the  notion 
of  form  in  our  definition  of  homonymy  includes  a word’s  phonological  form 
in  context,  there  will  are  be  a huge  number  of  homonyms  in  English. 

Of  course,  none  of  these  examples  arc  traditionally  be  considered  good 
candidates  for  homonymy.  The  notion  of  homonymy  is  most  closely  asso- 
ciated with  the  field  of  lexicography,  where  normally  only  dictionary  en- 
tries with  identical  citation-forms  arc  considered  candidates  for  homonymy. 
Citation-forms  arc  the  orthographic-forms  that  arc  used  to  alphabetically  in- 
dex words  in  a dictionary,  which  in  English  correspond  to  what  we  have  been 
calling  the  root  form  of  a word.  Under  this  view,  words  with  the  same  pro- 
nunciation but  different  spellings  arc  not  considered  homonyms,  but  rather 
homophones,  distinct  lexemes  with  a shared  pronunciation. 

Of  course,  there  arc  also  pairs  of  lexemes  with  identical  orthographic 
forms  with  different  pronunciations.  Consider,  for  example,  the  distinct  fish 
and  music  meanings  associated  with  the  orthographic  form  bass  in  the  fol- 
lowing examples. 

(16.3)  The  expert  angler  from  Dora,  Mo.,  was  fly-casting  for  bass  rather 
than  the  traditional  trout. 

(16.4)  The  curtain  rises  to  the  sound  of  angry  dogs  baying  and  ominous 
bass  chords  sounding. 

While  these  examples  more  closely  fit  the  traditional  definition  of  homonymy, 
they  would  only  rarely  appeal-  in  any  traditional  list  of  homonyms.  Instead, 
lexemes  with  the  same  orthographic  form  with  unrelated  meanings  are  called 

homographs. 

Finally,  we  should  note  that  lexemes  with  different  parts  of  speech  are 
also  typically  not  considered  to  be  good  candidates  for  homonymy.  This 
restriction  serves  to  rule  out  examples  such  as  would  and  wood,  on  grounds 
other  than  their  orthography.  The  basis  for  this  restriction  is  two-fold:  first 
as  we  saw  when  we  discussed  part-of-speech  tagging,  lexemes  with  such 
different  parts  of  speech  are  easily  distinguished  based  on  their  differing 
syntactic  environments,  and  secondly  lexical  items  can  take  on  many  distinct 
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forms  based  on  their  inflectional  and  derivational  morphology,  which  is  in 
turn  largely  based  on  part-of-speech. 

To  complicate  matters,  the  issue  of  differing  morphology  can  also  oc- 
cur with  lexemes  that  have  the  same  part-of-speech.  Consider  the  lexemes 
find  and  found  in  their  locating  and  creating  an  institution  meanings,  as  il- 
lustrated in  the  following  WSJ  examples. 

(16.5)  He  has  looked  at  14  baseball  and  football  stadiums  and  found  that 
only  one  - - private  Dodger  Stadium  - brought  more  money  into  a city 
than  it  took  out. 

(16.6)  Culturally  speaking,  this  city  has  increasingly  displayed  its 
determination  to  found  the  sort  of  institutions  that  attract  the  esteem 
of  Eastern  urbanites. 

Here  we  have  two  lexemes  with  distinct  root  forms,  find  and  found,  that 
nevertheless  share  the  morphological  valiant  found  as  the  past  tense  of  the 
first,  and  the  root  of  the  second. 

At  this  point,  having  raised  all  of  these  complexities,  we  might  cre- 
ate a more  refined  definition  for  homonymy  as  two  lexemes  with  unrelated 
meanings,  the  same  paid  of  speech,  and  identical  orthographic  and  phonolog- 
ical forms  in  all  possible  morphological  derivations.  Under  this  definition, 
all  homonyms  would  also  be  both  homographs  and  homophones,  with  the 
converse  not  necessarily  being  the  case.  Under  this  new  definition,  most  of 
the  homographs  and  homophones  presented  earlier  would  be  ruled  out  as 
homonyms. 

Such  definitional  exercises,  however,  merely  obscure  our  reason  for 
raising  the  issue  of  homonymy  in  the  first  place;  homonymy  is  of  interest 
computationally  to  the  extent  that  it  leads  an  application  into  dealing  with 
ambiguity.  Whether  or  not  a given  pair  of  lexemes  cause  ambiguity  to  arise 
in  an  application  is  entirely  dependent  on  the  nature  of  the  application.  As  we 
will  see  in  the  following  discussion  of  various  applications,  distinguishing 
perfect  examples  of  homonymy  from  imperfect  examples  is  of  very  little 
practical  value.  The  critical  issue  is  whether  the  nature  of  the  form  overlap 
is  likely  to  cause  difficulties  for  a given  application. 

In  spelling  correction,  homophones  can  lead  to  real-word  spelling  er- 
rors, or  malapropisms,  as  when  lexemes  such  as  weather  and  whether  arc 
interchanged.  Note  that  this  is  a case  where  a phonological  overlap  causes  a 
problem  for  a purely  text-based  system.  Additional  problems  in  spelling  cor- 
rection arc  caused  by  such  imperfect  homographs  as  find  and  found,  which 
have  partially  overlapping  morphologies.  In  this  case,  a word-form  like 
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founded  may  represent  a correct  use  of  the  past  tense,  or  an  incorrect  over- 
application of  the  regular  past  tense  rule  to  an  irregular  verb. 

In  speech  recognition,  homophones  such  as  to,  two  and  too  cause  ob- 
vious problems.  What  is  less  clear,  however,  is  that  perfect  homonyms  such 
as  bank  arc  also  problematic.  Recall  that  speech  recognition  systems  rely 
on  language  models  that  arc  often  based  on  tables  of  N-gram  probabilities. 

For  perfect  homonyms,  the  entries  for  all  the  distinct  lexemes  arc  conflated 
despite  the  fact  that  the  different  lexemes  occur  in  different  environments. 

This  conflation  results  in  inappropriately  high  probabilities  to  words  that  arc 
cohorts  of  the  lexeme  not  in  use,  and  lower  than  appropriate  probabilities  to 
the  correct  cohorts. 

Finally,  text-to-speech  systems  arc  vulnerable  to  homographs  with 
distinct  pronunciations.  This  problem  can  be  avoided  to  some  extent  with  ex- 
amples such  as  conduct  whose  different  pronunciations  arc  associated  with 
the  distinct  parts  of  speech  through  the  use  of  part-of-speech  tagging.  How- 
ever, for  other  examples  like  bass  the  two  lexemes  must  be  distinguished 
by  some  other  means.  Note  that  this  situation  is  the  reverse  of  the  one  we 
had  with  spelling  correction,  here  a fundamentally  speech-oriented  system 
is  being  plagued  by  an  orthographic  problem. 

Polysemy 

Having  muddied  the  waters  discussing  issues  of  form  and  homonymy,  let 
us  return  to  the  topic  of  what  it  means  for  two  meanings  to  be  related  or 
unrelated.  Recall  that  the  definition  of  homonymy  requires  that  the  lexemes 
in  question  have  distinct  and  unrelated  meanings.  This  is  the  crux  of  the 
matter;  if  the  meanings  in  question  arc  related  in  some  way  then  we  arc 
dealing  with  a single  lexeme  with  more  than  one  meaning,  rather  than  two 
separate  lexemes.  This  phenomenon  of  a single  lexeme  with  multiple  related 
meanings  is  known  as  polysemy.  Note  that  earlier  we  had  defined  a lexeme  polysemy 
as  a pairing  between  a surface  form  and  a sense.  Here  we  will  expand  that 
notion  to  be  a pairing  of  a form  with  a set  of  related  senses. 

To  make  this  notion  more  concrete,  consider  the  following  bank  exam- 
ple from  the  WSJ  coipus. 

(16.7)  While  some  banks  furnish  sperm  only  to  married  women,  others  arc 

much  less  restrictive. 

Although  this  is  clearly  not  a use  of  the  sloping  mound  meaning  of  bank, 
it  just  as  clearly  is  not  a reference  to  a promotional  giveaway  at  a financial 
institution.  One  way  to  deal  with  this  use  would  be  to  create  bank3,  yet 
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another  distinct  lexeme  associated  with  the  form  bank , and  give  it  a mean- 
ing appropriate  to  this  use.  Unfortunately,  according  to  our  definition  of 
homonymy,  this  would  require  us  to  say  that  the  meaning  of  bank  in  this  ex- 
ample is  distinct  and  unrelated  to  the  financial  institution  sense,  which  seems 
to  be  far  too  strong  a statement.  The  notion  of  polysemy  allows  us  to  state 
that  this  sense  of  bank  is  related  to,  and  possibly  derived  from,  the  financial 
institution  sense,  without  asserting  that  it  is  a distinct  lexeme. 

As  one  might  suspect,  the  task  of  distinguishing  homonyny  from  pol- 
ysemy is  not  quite  as  straightforward  as  we  made  it  seem  with  these  bank 
examples.  There  arc  two  criteria  that  arc  typically  invoked  to  determine 
whether  or  not  the  meanings  of  two  lexemes  arc  related  or  not:  the  history, 
etymology  or  etymology,  of  the  lexemes  in  question,  and  how  the  words  arc  conceived 
of  by  native  speakers.  In  practice,  an  ill-defined  combination  of  evidence 
from  these  two  sources  is  used  to  distinguish  homonymous  from  polysemous 
lexical  entries.  In  the  case  of  bank,  the  etymology  reveals  that  bank1  has  an 
Italian  origin,  while  bank2  is  of  Scandinavian  origin,  thus  encouraging  us  to 
list  them  as  distinct  lexemes.  On  the  other  hand,  our  belief  that  the  use  of 
bank  in  Example  16.7  is  related  to  bank1  is  based  on  introspection  about  the 
similarities  of  their  meanings,  and  the  lack  of  any  etymological  evidence  for 
an  independent  third  sense. 

In  the  absence  of  detailed  etymological  evidence,  a useful  intuition  to 
use  in  distinguishing  homonymy  from  polysemy  is  the  notion  of  coincidence. 
Cases  of  homonymy  can  usually  be  understood  easily  as  accidents  of  history 
- two  lexemes  which  have  coincidentally  come  to  share  the  same  form.  On 
the  other  hand,  it  is  far  more  difficult  to  accept  cases  of  polysemy  as  coinci- 
dences. Returning  again  to  our  bank  example,  it  is  difficult  to  accept  the  idea 
that  the  various  uses  of  bank  in  all  of  its  various  repository  senses  arc  only 
coincidentally  related  to  the  savings  institution  sense. 

Once  we  have  determined  that  we  arc  dealing  with  a polysemous  lex- 
eme, we  arc  of  course  still  left  with  the  task  of  managing  the  potentially 
numerous  polysemous  senses  associated  with  it.  In  particular,  for  any  given 
single  lexeme  we  would  like  to  be  able  to  answer  the  following  questions. 

• What  distinct  senses  arc  there? 

• How  arc  these  senses  related? 

• How  can  they  be  reliably  distinguished? 

The  answers  to  these  questions  can  have  serious  consequences  for  well  how 
semantic  analyzers,  search  engines,  generators,  and  machine  translation  sys- 
tems perform  their  respective  tasks.  The  first  two  questions  will  be  covered 
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here  and  in  Section  16.4,  while  the  final  question  will  be  covered  in  depth  in 
Chapter  17. 

The  issue  of  deciding  how  many  distinct  senses  should  be  associated 
with  a given  polysemous  lexeme  is  a task  that  has  long  vexed  lexicographers, 
who  until  recently  have  been  the  only  people  engaged  in  the  creation  of  large 
lexical  databases.  Most  lexicographers  take  the  approach  of  creating  entries 
with  as  many  senses  as  necessary  to  account  for  all  the  fine  distinctions  in 
meaning  observed  in  some  very  large  corpus  of  examples.  This  is  a reason- 
able approach  given  that  the  primary  use  for  a traditional  dictionary  is  to 
assist  users  in  learning  the  various  uses  of  a word.  Unfortunately,  it  tends  to 
err  on  the  side  of  making  more  distinctions  than  arc  normally  required  for 
any  reasonable  computational  application. 

To  make  this  notion  of  distinguishing  distinct  senses  more  concrete, 
consider  the  following  uses  of  the  verb  serve  from  the  WSJ  corpus. 

(16.8)  They  rarely  serve  red  meat,  preferring  to  prepare  seafood,  poultry  or 
game  birds. 

(16.9)  He  served  as  U.S.  ambassador  to  Norway  in  1976  and  1977. 

(16.10)  He  might  have  served  his  time,  come  out  and  led  an  upstanding  life. 

Reasonable  arguments  can  be  made  that  each  of  these  examples  rep- 
resents a distinct  sense  of  serve.  For  example,  the  implicit  contrast  be- 
tween serving  red  meat  and  preparing  seafood  in  the  first  example  indicates 
a strong  connection  between  this  sense  of  serve  and  the  related  notion  of 
food  preparation.  Since  there  is  no  similar  component  in  any  of  the  other 
examples,  we  can  assume  that  this  first  use  is  distinct  from  the  other  two. 

Next,  we  might  note  that  the  second  example  has  a different  syntactic  sub- 
categorization from  the  others  since  its  first  argument,  which  denotes  the 
role  played  by  the  subject,  is  a prepositional  phrase.  As  will  be  discussed 
in  Section  16.3,  such  differing  syntactic  behaviors  arc  often  symptomatic  of 
differing  underlying  senses.  Finally,  the  third  example  is  specific  to  the  do- 
main of  incarceration.  This  is  clear  since  this  example  provides  almost  no 
specific  information  about  prison,  and  yet  has  an  obvious  and  clear  meaning; 
a meaning  which  plays  no  role  in  the  other  examples. 

Another  practical  technique,  for  determining  if  two  distinct  senses  arc 
present  is  to  combine  two  separate  uses  of  a lexeme  into  a single  example 
using  a conjunction,  a device  has  the  rather  improbable  name  of  zeugma,  zeugma 
Consider  the  following  ATIS  examples. 

(16.11)  Which  of  those  flights  serve  breakfast? 

(16.12)  Does  Midwest  express  serve  Philadelphia? 
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(16.13)  ?Does  Midwest  express  serve  breakfast  and  Philadelphia? 

The  oddness  of  invented  third  example  indicates  there  is  no  sensible  way  to 
make  a single  sense  of  serve  work  for  both  breakfast  and  Philadelphia.  More 
precisely,  the  underlying  concepts  invoked  by  serve  in  the  first  example  can 
not  be  applied  in  any  meaningful  way  to  Philadelphia.  This  is  an  instance 
where  we  can  make  use  of  examples  from  a corpus  along  with  our  native 
intuitions  in  a structured  way  to  discover  the  presence  or  distinct  senses. 

The  issue  of  discovering  the  proper  set  of  senses  for  a given  lexeme  is 
distinct  from  the  process  of  determining  which  sense  of  a lexeme  is  being 
used  in  a given  example.  This  latter  task  is  called  word  sense  disambigua- 
tion, or  word  sense  tagging  by  analogy  to  part-of-speech  tagging,  and  is 
covered  in  detail  in  Chapter  17.  As  this  analogy  implies,  the  task  typically 
presumes  that  a fixed  set  of  senses  can  be  associated  with  each  lexical  item, 
a dubious  proposition  that  we  will  take  up  in  Section  16.4. 

Finally,  let  us  turn  briefly  to  the  topic  of  relatedness  among  the  various 
senses  of  a single  polysemous  lexeme.  Earlier,  we  made  an  appeal  to  the 
intuition  that  the  polysemous  senses  of  a lexeme  arc  unlikely  to  have  come 
about  by  coincidence.  This  raises  the  obvious  question  that  if  they  arc  not 
related  by  coincidence,  how  arc  they  related.  This  question  has  not  received 
much  attention  from  those  constructing  large  lexicons  since  as  long  as  the 
lexicon  contains  the  correct  senses,  how  they  came  to  be  there  is  largely 
irrelevant.  However,  as  soon  as  applications  begin  to  deal  with  a wide  variety 
of  inputs,  they  encounter  novel  uses  that  do  not  correspond  to  any  of  the 
static  senses  in  the  system's  lexicon.  By  examining  the  systematic  relations 
among  listed  senses,  we  can  gain  insight  into  the  meanings  of  such  novel 
uses.  These  notions  will  be  discussed  in  more  detail  in  Section  16.4. 

Synonymy 

The  phenomenon  of  synonymy  is  sufficiently  widespread  to  account  for  the 
popularity  of  both  thesauri  and  crossword  puzzles.  As  with  homonymy,  the 
notion  of  synonymy,  has  a deceptively  simple  definition:  different  lexemes 
with  the  same  meaning.  Of  course,  this  definition  leaves  open  the  question 
of  what  it  means  for  two  lexemes  to  mean  the  same  thing.  Although  Sec- 
tion 16.3  will  provide  some  answers  to  this  question,  we  can  make  progress 
without  answering  it  directly  by  invoking  the  notion  of  substitutability:  two 
lexemes  will  be  considered  synonyms  if  they  can  substituted  for  one  another 
in  a sentence  without  changing  either  the  meaning  or  the  acceptability  of 
the  sentence.  The  following  ATIS  examples  illustrate  this  notion  of  substi- 
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tutability. 

(16.14)  How  big  is  that  plane? 

(16.15)  Would  I be  flying  on  a large  or  small  plane? 

Exchanging  big  and  large  in  these  examples  has  no  noticeable  effect 
on  either  the  meaning  or  acceptability  of  these  sentences.  We  can  take  this 
as  evidence  for  the  synonymy  of  big  and  large,  at  least  for  these  examples. 
Note  that  this  is  intended  to  be  a very  narrow  statement.  In  particular,  we  arc 
not  saying  anything  about  the  relative  likelihood  of  occurrence  of  big  and 
large  in  contexts  similar  to  these. 

Not  surprisingly,  if  we  take  the  notion  of  substitutability  to  mean  sub- 
stitutable in  all  possible  environments,  then  true  synonyms  in  English  arc  few 
and  far  between,  as  it  is  almost  always  possible  to  find  some  sentence  where 
a purported  synonym  fails  to  substitute  successfully.  Given  this,  we  will  fall 
back  on  a weaker  notion  that  allows  us  to  call  two  lexemes  synonyms  if  they 
arc  substitutable  in  some  environment.  This  is,  for  all  practical  puiposes,  the 
notion  of  synonymy  used  in  most  dictionaries  and  thesauri. 

The  success  or  failure  of  the  substitution  of  a given  pair  of  candidate 
synonyms  in  a given  setting  depends  primarily  on  four  influences:  polysemy, 
subtle  shades  of  meaning,  collocational  constraints,  and  register.  As  we  will 
see,  only  the  first  two  involve  the  notion  of  meaning. 

To  explore  the  effect  of  polysemy  on  substitutability,  consider  the  fol- 
lowing WSJ  example  where  a substitution  of  large  for  big  clearly  fails. 

(16.16)  Miss  Nelson,  for  instance,  became  a kind  of  big  sister  to  Mrs.  Van 
Tassel’s  son,  Benjamin. 

(16.17)  ?Miss  Nelson,  for  instance,  became  a kind  of  large  sister  to  Mrs. 
Van  Tassel’s  son,  Benjamin. 

The  source  of  this  failure  is  the  fact  that  the  lexeme  big  has  as  one  of  its 
distinct  polysemous  senses  the  notion  of  being  older,  or  grown  up.  Since 
the  lexeme  large  lacks  this  sense  among  its  many  meanings,  it  is  not  sub- 
stitutable for  big  in  those  environments  where  this  sense  is  required.  In  this 
instance,  the  result  is  a sentence  with  a different  meaning  altogether.  In  other 
cases,  such  a substitution  may  result  in  a sentence  that  is  either  odd  or  en- 
tirely uninterpretable. 

We  referred  to  the  next  influence  on  synonymy  as  shades  of  mean- 
ing. By  this,  we  have  in  mind  cases  where  two  lexemes  share  a central  core 
meaning,  but  where  additional  ancillary  facts  are  associated  with  one  the 
lexemes.  Consider  the  use  of  the  lexemes  price  and  fare  in  the  ATIS  corpus. 
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Semantically,  both  have  the  notion  of  the  cost  for  a service  at  the  core  of 
their  meanings.  They  arc  not,  however,  freely  interchangeable.  Consider  the 
following  ATIS  examples. 

(16.18)  What  is  the  cheapest  first  class  fare? 

(16.19)  ?What  is  the  cheapest  first  class  price? 

Exchanging  price  for  fare  in  this  example  leads  to  a certain  amount  of 
oddity.  The  source  of  this  oddness  is  hard  to  pin  down,  but  fare  seems  to  be 
better  suited  to  the  costs  for  various  services  (ie.  coach,  business  and  first 
class  fares),  while  price  seems  better  applied  to  the  tickets  that  represent 
these  services.  Of  course,  a more  complete  account  of  how  these  lexemes 
arc  used  in  this  domain  would  require  a systematic  analysis  of  a corpus  of 
examples.  The  point  is  that  although  these  terms  share  a core  meaning,  there 
arc  subtle  meaning-related  differences  that  influence  how  they  can  be  used. 

These  two  influences  on  substitutability  clearly  involve  the  meanings 
of  the  lexical  items.  There  arc,  however,  other  influences  on  the  success 
or  failure  of  a synonym  substitution  that  arc  not  based  on  meaning  in  any 
direct  way.  Collocational  constraints  arc  one  such  influence.  By  a colloca- 
tional constraint,  we  mean  the  kind  of  arbitrary  associations,  or  attractions, 
between  lexical  items  that  were  captured  using  techniques  such  as  N-grams 
in  Chapter  6. 

Consider  the  following  WSJ  example. 

(16.20)  We  frustrate  ’em  and  frustrate  ’em,  and  pretty  soon  they  make  a big 
mistake. 

(16.21)  ?We  frustrate  ’em  and  frustrate  ’em,  and  pretty  soon  they  make  a 
large  mistake. 

As  this  example  illustrates,  there  is  a preference  for  using  big  rather  than 
large  when  referring  to  mistakes  of  a critical  or  important  nature  . This  is 
not  due  to  a polysemy  difference,  nor  does  it  seem  to  be  due  to  any  subtle 
shaded  meaning  difference  between  big  and  large.  Note  also,  that  this  is 
clearly  different  than  the  large  sister  example  in  that  a large  mistake  is  still 
interpretable  in  the  correct  way;  it  just  does  not  seem  as  natural  to  use  large 
as  big.  Therefore,  in  this  case,  we  must  say  that  there  is  simply  an  arbitrary 
preference  for  big  as  opposed  to  large  as  applied  to  mistakes. 
register  Finally,  by  register,  we  mean  the  social  factors  that  surround  the  use  of 

possible  synonyms.  Here  we  arc  referring  to  lexemes  with  essentially  identi- 
cal meanings  that  arc  not  interchangeable  in  all  environments  due  to  factors 
such  as  politeness,  group  status,  and  other  similar  social  pressures.  For  ex- 
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ample,  multisyllabic  lexemes  with  Latin  or  Greek  origins  arc  often  used  in 
place  of  shorter  lexemes  when  a technical  or  academic  style  is  desired. 

As  was  the  case  with  homonymy,  these  influences  on  synonymy  have 
differing  practical  implications  for  computational  applications.  In  Chapters 
19  and  20,  we  will  see  that  similarity  of  meaning,  collocational  constraints, 
and  appropriateness  of  use  arc  of  great  importance  in  natural  language  gen- 
eration and  machine  translation.  On  the  other  hand,  in  the  domains  of  infor- 
mation extraction  and  information  retrieval,  appropriateness  of  use  is  of  far 
less  consequence  than  the  notion  of  identity  of  meaning. 

Hyponymy 

In  our  discussion  of  price  and  fare , we  introduced  the  notion  of  pairs  of 
lexemes  with  similar  but  non-identical  meanings.  The  notion  of  hyponymy 
is  based  on  a restricted  class  of  such  pairings:  pairings  where  one  lexeme 
denotes  a subclass  of  the  other.  For  example,  the  relationship  between  car 
and  vehicle  is  one  of  hyponymy.  Since  this  relation  is  not  symmetric  we  will 
refer  to  the  more  specific  lexeme  as  a hyponym  of  the  more  general  one, 
and  conversely  to  the  more  general  term  as  a hypernym  of  the  more  specific 
one.  We  would  therefore  say  that  car  is  a hyponym  of  vehicle,  and  vehicle  is 
hypernym  of  car. 

As  with  synonymy,  we  can  explore  the  notion  of  hyponymy  by  making 
use  of  a restricted  kind  of  substitution.  Consider  the  following  schema. 

That  is  a x.  =>•  That  is  a y. 

If  x is  a hyponym  of  y,  then  in  any  situation  where  the  sentence  on  the  left 
is  true,  the  newly  created  sentence  on  the  right  must  also  be  true,  as  in  the 
following  example. 

That  is  a car.  =>•  That  is  a vehicle. 

There  a number  of  important  differences  between  this  kind  of  lim- 
ited substitution  and  the  kind  of  substitutions  discussed  with  respect  to  syn- 
onymy. There  the  resulting  sentence  could  plausibly  serve  as  a substitute  for 
the  original  sentence.  Here,  the  new  sentence  is  not  intended  to  be  a sub- 
stitution for  the  original,  rather  it  is  merely  serves  as  diagnostic  test  for  the 
presence  of  hyponomy. 

The  concept  of  hyponymy  is  closely  related  to  a number  of  other  no- 
tions that  play  central  roles  in  biology,  linguistic  anthropology  and  computer 
science. 

The  term  ontology  usually  refers  to  an  analysis  of  some  domain,  or  mi- 
croworld, into  a set  of  distinct  objects.  A taxonomy  is  a particular  arrange  - 
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ment  of  the  elements  of  an  ontology  into  a tree-like  class  inclusion  structure. 
Normally,  there  arc  a set  of  well-formedness  constraints  on  taxonomies  that 
go  beyond  their  component  class  inclusion  relations.  For  example,  the  lex- 
emes hound,  mutt,  and  puppy  arc  all  hyponyms  of  dog,  but  it  would  be  odd 
to  construct  a taxonomy  from  those  pairs  since  the  concepts  motivating  the 
relations  is  different  in  each  case.  Finally,  the  computer  science  notion  of 
an  object  hierarchy  is  based  the  notion  that  objects  from  an  ontology  ar- 
ranged in  a taxonomy,  can  receive,  or  inherit,  features  from  their  ancestors 
in  a taxonomy.  This,  of  course,  only  makes  sense  when  the  elements  in  the 
taxonomy  arc  in  fact  complex  structured  objects  with  features  to  be  inherited. 

Therefore,  sets  of  hyponymy  relations,  by  themselves,  do  not  consti- 
tute an  ontology,  category  structure,  taxonymy,  or  object  hierarchy.  They 
have,  however,  proved  to  be  useful  as  approximations  to  such  structures.  We 
will  return  to  the  topic  of  hyponymy  in  Section  16.2  when  we  discuss  the 
WordNet  database. 


WordNet:  A Database  of  Lexical  Relations 

The  widespread  use  of  lexical  relations  in  linguistic,  psycholinguistic,  and 
computational  research  has  led  to  a number  of  efforts  to  create  large  elec- 
tronic databases  of  such  relations.  These  efforts  have,  in  general,  followed 
one  of  two  basic  approaches:  mining  information  from  existing  dictionaries 
and  thesauri,  and  handcrafting  a database  from  scratch.  Despite  the  obvious 
advantages  of  reusing  existing  resources,  WordNet,  the  most  well-developed 
and  widely  used  lexical  database  for  English,  was  developed  using  the  latter 
approach  (Beckwith  et  ah,  1991). 

WordNet  consists  of  three  separate  databases,  one  each  for  nouns  and 
verbs,  and  a third  for  adjectives  and  adverbs;  closed  class  lexical  items  arc 
not  included  in  WordNet.  Each  of  the  three  databases  consists  of  a set  of 
lexical  entries  corresponding  to  unique  orthographic  forms,  accompanied  by 
sets  of  senses  associated  with  each  form.  Figure  16.1  gives  some  idea  of  the 
scope  of  the  current,  WordNet  1.6,  release.  The  databases  can  be  accessed 
directly  with  a browser  (locally  or  over  the  Internet),  or  programmatically 
through  the  use  of  a set  of  C library  functions. 

In  their  most  complete  form,  WordNet’s  sense  entries  consist  of  a set 
of  synonyms,  a dictionary-style  definition,  or  gloss,  and  some  example  uses. 
Figure  16.2  shows  an  abbreviated  version  of  the  wordnet  entry  for  the  noun 
bass.  As  this  entry  illustrates,  there  arc  several  important  differences  be- 
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tween  WordNet  entries  and  our  notion  of  a lexeme.  First,  since  WordNet 
contains  no  phonological  information,  it  makes  no  attempt  to  keep  sepa- 
rate lexemes  with  distinct  pronunciations.  For  example,  in  this  entry  bass4, 
bass5,  and  bass8  all  refer  to  the  [b  ae  s]  fish  sense,  while  the  others  refer 
to  the  [b  ey  s]  musical  sense.  More  generally,  WordNet  makes  no  attempt 
to  distinguish  homonymy  from  polysemy.  For  example,  as  far  as  this  en- 
try is  concerned,  bass1  bears  the  same  relationship  to  bass2  as  it  does  to 
bass4.  This  is  a conservative  strategy  that  reflects  the  fact  that  although 
there  arc  fairly  reliable  diagnostics  for  discriminating  among  distinct  word 
senses,  systematically  organizing  the  resulting  polysemous  senses  is  a much 
more  uncertain  and  subjective  activity.  Given  this,  the  developers  of  Word- 
Net  have  opted  to  simply  list  distinct  senses,  without  attempting  to  explicitly 
organize  them  in  the  hierarchical  manner  seen  in  many  dictionaries. 

Figures  16.3  and  16.4  give  a rough  idea  of  how  these  senses  arc  dis- 
tributed throughout  the  database.  The  distributions  arc  extremely  skewed, 
with  a small  number  of  entries  having  a large  number  of  senses,  and  a large 


Category 

Unique  Forms 

Number  of  Senses 

Noun 

94474 

116317 

Verb 

10319 

22066 

Adjective 

20170 

29881 

Adverb 

4546 

5677 

Figure  16.1  Scope  of  the  current  WordNet  1.6  release  in  terms  of  unique 
entries  and  total  number  of  senses  for  the  four  databases. 

The  noun  “bass”  has  8 senses  in  WordNet. 

1.  bass  - (the  lowest  part  of  the  musical  range) 

2.  bass,  bass  part  - (the  lowest  part  in  polyphonic  music) 

3.  bass,  basso  - (an  adult  male  singer  with  the  lowest  voice) 

4.  sea  bass,  bass  - (flesh  of  lean-fleshed  saltwater  fish  of  the  family  Serranidae) 

5.  freshwater  bass,  bass  - (any  of  various  North  American  lean-fleshed  freshwater 

fishes  especially  of  the  genus  Micropterus) 

6.  bass,  bass  voice,  basso  - (the  lowest  adult  male  singing  voice) 

7.  bass  - (the  member  with  the  lowest  range  of  a family  of  musical  instruments) 

8.  bass  - (nontechnical  name  for  any  of  numerous  edible  marine  and 

freshwater  spiny-finned  fishes) 


Figure  16.2  The  WordNet  1.6  entry  for  the  noun  bass. 
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number  having  a single  sense.  Distributions  like  this  arc  ubiquitous  when 
dealing  with  the  lexicon,  and  arc  referred  to  as  Zipf  distributions  (Zipf, 
1949).  Note  also  that  the  degree  of  polysemy  in  the  verb  database  is  higher 
than  in  the  noun  database.  This  is  consistent  with  the  fact  that  there  arc  far 
fewer  verbs  than  nouns  in  English  and  their  meanings  arc  far  more  malleable. 
Finally,  we  should  note  that  these  polysemy  distributions  correlate  well  with 
actual  word  frequency  and  led  the  WordNet  developers  to  use  degree  of  pol- 
ysemy as  a proxy  for  frequency  in  the  database. 


30 


25 


20 

C 

C/2 

= 15 

£ 

3 

z 

10 


5 


0 

0 10000  20000  30000  40000  50000  60000  70000  80000  90000  100000 

Polysemy  Rank 


Figure  16.3  Distribution  of  senses  among  the  nouns  in  WordNet. 


Of  course,  a simple  listing  of  lexical  entries  would  not  be  much  more 
useful  than  an  ordinary  dictionary.  The  power  of  WordNet  lies  in  its  set 
of  domain-independent  lexical  relations.  These  relations  can  hold  among 
WordNet  entries,  senses,  or  sets  of  synonyms.  They  arc,  for  the  most  paid, 
restricted  to  items  with  the  same  part-of-speech,  or  more  pragmatically,  to 
items  within  the  same  database.  Figures  16.5,  16.6,  and  16.7  show  a subset 
of  the  relations  associated  with  each  of  the  three  databases,  along  with  a 
brief  explanation  and  an  example.  Since  a full  discussion  of  the  contents 
of  WordNet  is  beyond  the  scope  of  this  text,  we  will  limit  ourselves  to  a 
discussion  of  two  of  its  most  useful  and  well-developed  features:  its  sets  of 
synonyms,  and  its  hyponymy  relations. 

The  fundamental  basis  for  synonymy  in  WordNet  is  the  same  as  that 
given  on  page  596.  Two  WordNet  entries  arc  considered  synonyms  if  they 
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Figure  16.4  Distribution  of  senses  among  the  verbs  in  WordNet. 


Definition 

From  concepts  to  superordinates 
From  concepts  to  subtypes 
From  groups  to  their  members. 

From  members  to  their  groups. 

From  things  to  what  they’re  made  of. 
From  stuff  to  what  it  makes  up. 

From  wholes  to  parts 
From  parts  to  wholes. 

Opposites 

Figure  16.5  Noun  Relations  in  WordNet. 
Definition 

From  events  to  superordinate  events 
From  events  to  their  subtypes 
From  events  to  the  events  they  entail 
Opposites 


Example 
fly  — > travel 
walk  — > stroll 
snore  — > sleep 
increase  ■<=>  decrease 


Relation 

Hypernym 

Troponym 

Entails 

Antonym 


Example 

breakfast  — >•  meal 
meal  — > lunch 
faculty  — > professor 
copilot  — > crew 

table  — >■  leg 
course  — > meal 
leader  — > follower 


Relation 

Hyperym 

Hyponym 

Has-Member 

Member-Of 

Has-Stuff 

Stuff-Of 

Has-Part 

Part-Of 

Antonym 


Figure  16.6  Verb  Relations  in  WordNet. 
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Relation 

Definition 

Example 

Antonym 

Opposite 

heavy  light 

Adverb 

Opposite 

quickly  slowly 

Figure  16.7  Adjective  and  Adverb  Relations  in  WordNet. 

can  be  successfully  substituted  in  some  context.  The  particular  theory  and 
implementation  of  synonymy  in  WordNet  is  organized  around  the  notion  of 
synset  a synset,  a set  of  synonyms.  Consider  the  following  example  of  a synset. 

{chump,  fish,  fool,  gull,  mark,  patsy,  fall  guy, 
sucker,  schlemiel,  shlemiel,  soft  touch,  mug} 

The  dictionary-like  definition,  or  gloss,  of  this  synset  describes  it  as  a per- 
son who  is  gullible  and  easy  to  take  advantage  of.  Each  of  the  lexical  entries 
included  in  the  synset  can,  therefore,  be  used  to  express  this  notion  in  some 
setting.  In  practice,  synsets  like  this  one  actually  constitute  the  senses  asso- 
ciated with  many  WordNet  entries.  Specifically,  it  is  this  exact  synset,  with 
its  associated  definition  and  examples,  that  makes  up  one  of  the  senses  for 
each  of  the  entries  listed  in  the  synset. 

Looking  at  this  from  a more  theoretical  perspective,  each  synset  can 
be  taken  to  represent  a concept  that  has  become  lexicalized  in  the  language. 
Synsets  are  thus  somewhat  analogous  to  the  kinds  of  concepts  we  discussed 
in  Chapter  14.  Instead  of  representing  concepts  using  logical  terms,  Word- 
Net  represents  them  as  lists  comprised  of  the  lexical  entries  that  can  be  used 
to  express  the  concept.  This  perspective  motivates  the  fact  that  it  is  synsets, 
not  lexical  entries  or  individual  senses,  that  participate  in  most  of  the  seman- 
tic relations  shown  in  Figures  16.5,  16.6,  and  16.7. 

The  hyponymy  relations  in  WordNet  correspond  directly  to  the  notion 
of  immediate  hyponymy  discussed  on  page  599.  Each  synset  is  related  to 
its  immediately  more  general  and  more  specific  synsets  via  direct  hypernym 
and  hyponym  relations.  To  find  chains  of  more  general  or  more  specific 
synsets,  one  can  simply  follow  a transitive  chain  of  hypernym  and  hyponym 
relations.  To  make  this  concrete,  consider  the  hypernym  chains  for  bass3 
and  bass7  shown  in  Figure  16.8. 

In  this  depiction  of  hyponymy,  successively  more  general  synsets  arc 
shown  on  successive  indented  lines.  The  first  chain  starts  from  the  concept 
of  a human  bass  singer.  It's  immediate  superordinate  is  a synset  correspond- 
ing to  the  generic  notion  of  a singer.  Following  this  chain  leads  eventually 
to  notions  such  as  entertainer  and  person.  The  second  chain,  which  starts 
from  the  musical  instrument  notion,  has  a completely  different  chain  leading 
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Sense  3 

bass,  basso  — 

(an  adult  male  singer  with  the  lowest  voice) 

=>  singer,  vocalist 

=>  musician,  instrumentalist,  player 
=>  performer,  performing  artist 
=>  entertainer 

=>  person,  individual,  someone... 

=>  life  form,  organism,  being... 

=>  entity,  something 
=>  causal  agent,  cause,  causal  agency 
=>  entity,  something 

Sense  7 
bass  — 

(the  member  with  the  lowest  range  of  a family  of 
musical  instruments) 

=>  musical  instrument 
=>  instrument 
=>  device 

=>  instrumentality,  instrumentation 
=>  artifact,  artefact 

=>  object,  physical  object 
=>  entity,  something 


Figure  16.8  Hyponymy  chains  for  two  separate  senses  of  the  lexeme  bass. 
Note  that  the  chains  are  completely  distinct,  only  converging  at  entity. 


eventually  such  concepts  as  musical  instrument,  device  and  physical  object. 
Both  paths  do  eventually  join  at  the  synset  entity  which  basically  serves  as  a 
placeholder  at  the  top  of  the  hierarchy. 


16.3  The  Internal  Structure  of  Words 

The  approach  to  meaning  spelled  out  in  the  last  two  chapters  hinged  on  the 
notion  that  there  is  a fundamental  predicate-argument  structure  underlying 
our  meaning  representations.  In  composing  such  representations,  we  as- 
sumed that  certain  classes  of  lexemes  tend  to  contribute  the  predicate  and 
predicate-argument  structure,  while  others  contribute  the  arguments.  This 
section  explores  in  more  detail  the  systematic  ways  that  the  meanings  of  lex- 
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DEEP  ROLES 


THEMATIC 

ROLE 


ernes  are  structured  to  support  this  notion.  In  particular,  it  explores  the  notion 
that  the  meaning  representations  associated  with  lexemes  have  analyzable  in- 
ternal structures,  and  that  it  is  these  structures,  combined  with  a grammar, 
that  determine  the  relations  among  lexemes  in  well-formed  sentences. 

Thematic  Roles 

Thematic  roles,  first  proposed  by  Gruber  (1965a)  and  Fillmore  (1968)1  are 
a set  of  categories  which  provide  a shallow  semantic  language  for  charac- 
terizing certain  arguments  of  verbs.  For  example  consider  the  following  two 
WSJ  fragments: 

(16.22)  Houston’s  Billy  Hatcher  broke  a bat. 

(16.23)  He  opened  a drawer. 

In  the  predicate  calculus  event  representation  of  Chapter  14,  paid  of  the 
representation  of  these  two  sentences  would  be  the  following: 

Be.x,y  Isa(e,  Breaking)  A Breaker  (e,BillyH atelier) 

ABrokenT hing(e.y ) A Isa  (y,  Basebal IBat ) 

Be.x,y  Isa(e,  Opening)  A Opener(e.he) 

A O pened Thing  (e,y)  A Isa  (y , Door) 

In  this  representation,  the  roles  of  the  subjects  of  the  verbs  break  and 
open  arc  Breaker  and  Opener  respectively.  These  deep  roles  arc  specific 
to  each  possible  kind  of  event;  Breaking  events  have  Breakers,  Opening 
events  have  Openers,  Eating  events  have  Eaters,  and  so  on.  But  Breakers 
and  Openers  have  something  in  common.  The  arc  both  volitional  actors,  of- 
ten animate,  and  they  have  direct  causal  responsibility  for  their  events.  A 
thematic  role  is  a way  of  expressing  this  commonality.  We  say  that  the 
subjects  of  both  these  verbs  arc  AGENTS.  Thus  AGENT  is  the  thematic  role 
which  represents  an  abstract  idea  such  as  volitional  causation.  Similar,  the 
direct  objects  of  both  these  verbs,  the  BrokenThing  and  OpenedTliing,  arc 
both  prototypically  inanimate  objects  which  arc  affected  in  some  way  by  the 
action.  The  thematic  role  for  these  participants  is  the  THEME. 

As  we  will  discuss  below,  while  there  is  no  standard  set  of  thematic 
roles,  there  arc  many  roles  that  arc  commonly  used  by  computational  sys- 
tems. For  example,  in  any  straightforward  interpretation  of  Example  16.24, 
Mr.  Cockwell  has  had  his  collarbone  broken,  but  there  is  no  implication 
that  he  was  the  AGENT  of  this  unfortunate  event.  This  kind  of  participant 


1 Fillmore  actually  called  them  deep  cases,  on  the  metaphor  of  morphological  case. 
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can  be  labeled  an  EXPERIENCED  while  the  directly  effected  participant,  the 
collarbone  in  this  case,  is  again  assigned  the  THEME  role. 

(16.24)  A company  soccer  game  last  year  got  so  rough  that  Mr.  Cockwell 
broke  his  collarbone  and  an  associate  broke  an  ankle. 

In  Example  16.25,  the  earthquake  is  the  direct  cause  of  the  glass  break- 
ing and  hence  might  seem  to  be  a candidate  for  an  AGENT  role.  This  seems 
odd,  however,  since  earthquakes  arc  not  the  kind  of  participant  that  can  inten- 
tionally do  anything.  Examples  such  as  this  have  been  the  source  of  consid- 
erable debate  over  the  year's  among  the  proponents  of  various  thematic  role 
theories.  Two  approaches  arc  common:  assign  the  earthquake  to  the  AGENT 
role  and  assume  that  the  intended  meaning  has  some  kind  of  metaphorical 
connection  to  the  core  animate/volitional  meaning  of  AGENT,  or  add  a role 
called  FORCE  that  is  similar-  to  AGENT  but  lacks  any  notion  of  volitionality. 
We  will  follow  this  latter  approach  and  return  to  the  notion  of  metaphor  in 
Section  16.4. 

(16.25)  The  quake  broke  glass  in  several  downtown  skyscrapers. 

Finally,  in  Example  16.26,  the  subject  ( it)  refers  to  an  event  participant 
(in  this  case,  someone  else’s  elbow)  whose  role  in  the  breaking  event  is  as 
the  instrument  of  some  other  agent  or  force.  Such  participants  are  called 
INSTRUMENTS. 

(16.26)  It  broke  his  jaw. 

Figure  16.9  presents  a small  list  of  commonly-used  thematic  roles 
along  with  a rough  description  of  the  meaning  of  each.  Figure  16.10  pro- 
vides representative  examples  of  each  of  role.  Note  that  this  list  of  roles 
is  by  no  means  definitive,  and  does  not  correspond  to  any  single  theory  of 
thematic  roles. 

Applications  to  Linking  Theory  and  Shallow  Semantic  Interpretations 

One  common  use  thematic  roles  in  computational  systems  is  as  a shallow 
semantic  language.  For  example,  as  Chapter  21  will  describe,  thematic  roles 
are  sometimes  used  in  machine  translation  systems  as  part  of  a useful  inter- 
mediate language. 

Another  use  of  thematic  roles,  which  was  part  of  their  original  moti- 
vation in  Fillmore  (1968),  was  as  an  intermediary  between  semantic  roles  in 
conceptual  structure  or  common-sense  knowledge  like  Breaker  and  Driven- 
Thing  and  their  more  language-specific  surface  grammatical  realization  as 
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Thematic  Role 

Definition 

AGENT 

EXPERIENCER 

FORCE 

THEME 

RESULT 

INSTRUMENT 

BENEFICIARY 

SOURCE 

GOAL 

The  volitional  causer  of  an  event 

The  experiencer  of  an  event 

The  non-volitional  causer  of  the  event 

The  participant  most  directly  affected  by  an  event 

The  end  product  of  an  event 

An  instrument  used  in  an  event 

The  beneficiary  of  an  event 

The  origin  of  the  object  of  a transfer  event 

The  destination  of  an  object  of  a transfer  event 

Figure  16.9  Some  commonly-used  thematic  roles  with  their  definitions. 

Thematic  Role 

Example 

AGENT 

The  waiter  spilled  the  soup 

EXPERIENCER 

John  has  a headache 

FORCE 

The  wind  blows  debris  from  the  mall  into  our  yards 

THEME 

Only  after  Benjamin  Franklin  broke  the  ice... 

RESULT 

The  French  government  has  built  a regulation-size  base- 
ball diamond... 

INSTRUMENT 

He  turned  to  poaching  catfish,  stunning  them  with  a shock- 
ing device 

BENEFICIARY 

Whenever  Ann  Callahan  makes  hotel  reservations  for  her 
boss... 

SOURCE 

I flew  in  from  Boston. 

GOAL 

I drove  to  Portland. 

Figure  16.10 

Prototypical  examples  of  various  thematic  roles. 

subject  and  object.  Fillmore  noted  that  there  arc  prototypical  patterns  gov- 
erning which  argument  of  a verb  will  become  the  subject  of  an  active  sen- 
tence, proposing  the  following  hierarchy  (often  now  called  a thematic  hier- 
“hy  archy  (Jackendoff,  1972))  for  assigning  the  subject  role: 

AGENT  >-  INSTRUMENT  THEME 

Thus  if  the  thematic  description  of  a verb  includes  an  AGENT,  an  IN- 
STRUMENT, and  a THEME,  it  is  the  AGENT  which  will  be  realized  as  the 
subject.  If  the  thematic  description  only  includes  an  INSTRUMENT  and  a 
THEME,  it  is  the  INSTRUMENT  which  will  become  the  subject.  The  thematic 
hierarchy  is  used  in  reverse  for  determining  the  direct  object  of  active  sen- 
tences, or  the  subject  of  passive  sentences.  Here  are  examples  from  Fillmore 
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(1968)  using  the  verb  open : 

(16.27)  John  opened  the  door. 

Agent  Theme 

(16.28)  John  opened  the  door  with  the  key. 

Agent  Theme  Instrument 

(16.29)  The  key  opened  the  door. 

Agent  Theme 

(16.30)  The  door  was  opened  by  John. 

Theme  Agent 

This  approach  led  to  a wide  variety  of  work  over  the  last  thirty  years 
on  the  mapping  between  conceptual  structure  and  grammatical  function,  in 
an  area  generally  referred  to  as  linking  theory.  For  example  many  scholars  theory 
such  as  Talmy  (1985),  Jackendoff  (1983b),  and  Levin  (1993)  show  that  se- 
mantic properties  of  verbs  help  predict  which  surface  alternations  they  can  tionsna" 
take.  An  alternation  is  a set  of  different  mappings  of  conceptual  (deep)  roles 
to  grammatical  function.  For  example  Fillmore  (1965)  and  very  many  subse- 
quent researchers  have  studied  the  dative  alternation,  the  phenomenon  that  alternation 
certain  verbs  like  give,  send , or  read  which  can  take  an  Agent,  a Theme, 
and  a Goal,  allow  the  Theme  to  appeal-  as  object  and  the  Goal  in  a prepo- 
sitional phrase  (as  in  16.31a),  or  the  Goal  to  appeal-  as  the  object,  and  the 
Theme  as  a sort  of  ‘second  object’  (as  in  16.31b): 

(16.31)  a.  Doris  gave/sent/read  the  book  to  Cary. 

Agent  Theme  Goal 

b.  Doris  gave/sent/read  Cary  the  book. 

Agent  Goal  Theme 

Many  scholars,  including  Green  (1974),  Pinker  (1989),  Gropen  et  al. 

(1989),  Goldberg  (1995)  and  Levin  (1993)  (see  Levin  (1993,  p.  45)  for  a full 
bibliography),  have  argued  this  alternation  occurs  with  particular  semantic 
classes  of  verbs,  including  (from  Levin)  ‘verbs  of  future  having’  ( advance , 
allocate,  offer,  owe),  ‘send  verbs’  (forward , hand,  mail),  ‘verbs  of  throwing’ 

(kick,  pass,  throw,  and  many  other  classes. 

Similarly.  Talmy  (1985),  following  Lakoff  (1965,  p.126),  shows  that 
‘affect’  verbs  such  as  frighten,  please,  and  exasperate  can  appeal-  with  the 
Theme  as  subject,  as  in  (16.32),  or  with  the  Experiencer  as  subject  and 
the  Theme  as  a prepositional  object,  as  in  (16.33): 

(16.32)  a.  That  frightens  me. 

Theme  Experiencer 

b.  That  interests  me. 

Theme  Experiencer 
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c.  That  surprises  me. 

Theme  Agent 

(16.33)  a.  I am  frightened  of  that. 

Experiencer  Theme 

b.  I am  interested  in  that. 

Experiencer  Theme 

c.  I am  surprised  at  that. 

Experiencer  Theme 

Levin  (1993)  summarizes  80  of  these  alternations,  including  extensive 
lists  of  the  verbs  in  each  semantic  class,  together  with  the  semantic  con- 
straints, exceptions,  and  other  idiosyncracies.  This  list  has  been  used  in  a 
number  of  computational  models  (e.g.  Dang  et  ah,  1998;  Jing  and  McKe- 
own,  1998) 

While  research  of  the  type  summarized  above  has  shown  a relation 
between  verbal  semantic  and  syntactic  realization,  it  is  less  clear  that  this  re- 
lation is  mediated  by  a small  set  of  thematic  roles,  with  or  without  a thematic 
hierarchy.  For  example,  it  turns  out  that  semantic  classes  arc  insufficient  to 
define  the  set  of  verbs  that  participate  in  an  alternation.  For  example  many 
verbs  do  not  allow  the  dative  alternation  despite  being  in  the  proper  semantic 
class  (e.g.  donate,  return,  transfer).  In  addition,  as  shown  above,  many  of 
the  verbal  alternations  violate  any  standard  thematic  hierarchy  (dative  alter- 
nation sentences  like  Ling  sent  Mary  the  book  have  a Goal  as  direct  object 
followed  by  an  oblique  Theme,  when  Theme  should  be  the  best  direct  ob- 
ject). Furthermore,  arguments  about  the  appropriate  set  of  thematic  roles 
arc  legion.  But  an  even  greater  problem  is  that  thematic  roles,  however  they 
arc  defined,  could  only  play  a very  small  role  in  the  general  mapping  from 
semantics  to  syntax.  This  is  because  thematic  roles  arc  only  relevant  to  de- 
termining the  grammatical  role  of  NP  and  PP  arguments,  and  play  no  paid 
in  the  realization  of  other  arguments  of  verbs  and  other  predicates.  Many 
such  possible  arguments  were  described  in  Figure  11.3  on  page  411,  such  as 
sentential  complements  (Sfin,  Swh-,  Sforto),  verb  phrases  (VPbrst,  VPto, 
etc),  or  quotations  (Quo).  Furthermore,  thematic  roles  only  are  useful  in 
mapping  the  arguments  of  verbs;  but  nouns,  for  example,  have  arguments  as 
well  ( destruction  of  the  city,  father  of  the  bride). 

There  arc  a number  of  possible  responses  to  these  problems  with  the- 
matic roles.  Many  systems  continue  to  use  them  for  such  practical  purposes 
as  interlinguas  in  machine  translation  or  as  a convenient  level  of  shallow 
semantic  interpretation.  Other  researchers  have  argued  that  thematic  roles 
should  be  considered  an  epiphenomenon,  rather  than  a distinct  represen- 
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tational  level.  For  example  following  Foley  and  van  Valin  (1984),  Dowty 
(1991)  argues  that  rather  than  a discrete  set  of  thematic  roles  there  arc  only 
two  cluster-concepts,  Proto-Agent  and  Proto-Patient.  Determining 
whether  an  argument  of  a verb  is  a Proto-Agent  is  predictable  from  the 
entailments  of  the  deep  conceptual  structure  meaning  of  the  verb.  The  map- 
ping from  semantic  role  in  conceptual  structure  to  grammatical  function  pro- 
ceeds via  simple  rules  (the  most  PROTO-AGENT-like  of  the  arguments  is  the 
subject,  the  most  PROTO-PATlENT-like  is  the  object  (or  the  subject  of  the 
passive  construction)).  Dowty’s  two  rules  make  direct  reference  to  the  deep 
conceptual  structure  of  the  verb;  thus  thematic  roles  do  not  appeal-  at  any 
representational  level  at  all. 

One  problem  with  Dowty’s  model  is  that  the  choice  of  thematic  roles 
is  not  always  predictable  from  the  underlying  conceptual  structure  of  the 
event  and  its  participants.  For  example  Fillmore  (1977)  pointed  out  that 
the  different  verbs  which  can  describe  a commercial  event  each  choose  a 
different  way  to  map  the  participants  of  the  event.  For  example,  a transaction 
between  Amie  and  Benson  involving  three  dollars  and  a sandwich  can  be 
described  in  any  of  these  ways: 

(16.34)  a.  Amie  bought  the  sandwich  from  Benson  for  three  dollars. 

b.  Benson  sold  Amie  the  sandwich  for  three  dollars. 

c.  Amie  paid  Benson  three  dollars  for  the  sandwich. 

Each  of  these  verbs  buy,  sell,  and  pay,  chooses  a different  perspective 
on  the  commercial  event,  and  realizes  this  perspective  by  choosing  a different 
mapping  of  underlying  participants  to  thematic  roles.  The  fact  that  these 
three  verbs  have  very  different  mappings  suggests  that  the  thematic  roles  for 
a verb  must  be  listed  in  the  lexical  entry  for  the  verb,  and  are  not  predictable 
from  the  underlying  conceptual  structure. 

This  fact,  together  with  the  fact  mentioned  earlier  that  verb  alternations 
are  not  completely  predictable  semantically  (e.g.  exceptions  like  donate)  has 
led  many  researchers  to  assume  that  any  useful  computational  lexicon  needs 
to  list  for  each  verb  (or  adjective  or  other  predicate)  its  syntactic  and  the- 
matic combinatory  possibilities.  Another  advantage  of  listing  the  combina- 
tory possibilities  for  each  verb  is  that  the  probability  of  each  thematic  frame 
can  also  be  listed. 

One  recent  attempt  to  list  these  elements  for  a number  of  predicates  of 
English  is  the  FRAMENET  project  (Baker  et  al,  1998;  Lowe  et  al,  1997).  A 
FRAMENET  entry  for  a word  lists  every  set  of  arguments  it  can  take,  includ- 
ing the  possible  sets  of  thematic  roles,  syntactic  phrases,  and  their  grammat- 
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ical  function.  The  thematic  roles  used  in  FRAMENET  arc  much  more  specific 
than  the  9 examples  we’ve  been  describing.  Each  FRAMENET  thematic  role 
is  defined  as  paid  of  a frame,  and  each  frame  as  paid  of  a domain.  For  exam- 
ple the  Cognition  domain  has  frames  like  static  cognition  ( believe , think , 
understand , etc),  cogitation  (brood,  ruminate),  judgment,  ( accuse , admire, 
rebuke),  etc.  All  of  the  cognition  frames  define  the  thematic  role  COGNIZER. 
In  the  judgment  frame,  the  COGNIZER  is  referred  to  as  the  JUDGE;  the  frame 
also  includes  an  Evaluee,  a REASON,  and  a Role;  here  are  some  examples 
from  (Johnson,  1998): 

Judge  Kim  respects  Pat  for  being  so  brave 
Evaluee  Kim  respects  Pat  for  being  so  brave 
Reason  Kim  respects  Pat  for  being  so  brave 
Role  Kim  respects  Pat  as  a scholar 

Each  entry  is  also  labeled  by  one  of  the  phrase  types  described  in 
Figure  11.3  on  page  411,  and  by  a grammatical  function  (subject,  object, 
or  complement).  For  example,  here  is  paid  of  the  FRAMENET  entry  for  the 
judgment  verb  appreciate',  we  have  shown  only  the  active  senses  of  the  verb; 
the  full  entry  includes  passives  as  well.  Example  sentences  are  (sometimes 
shortened)  from  the  British  National  Corpus: 

(16.35)  a.  Judge  Reason  Evaluee 

NP/Subj  NP/Obj  PP(in)/Comp 

I still  appreciate  good  manners  in  men. 

b.  Judge  Evaluee  Reason 

NP/Subj  NP/Obj  PP(for)/Comp 

I could  appreciate  it  for  the  music  alone. 

c.  Judge  Reason 

NP/Subj  NP/Obj 

I appreciate  your  kindness 

d.  Judge  Evaluee  Role 

NP/Subj  NP/Obj  PP(for)/Comp 

He  did  not  appreciate  the  artist  as  a dissenting  voice. 

By  contrast,  another  sense  of  the  verb  appreciate  is  as  a verb  of  static 
cognition  like  understand',  verbs  of  static  cognition  have  roles  like  COG- 
NIZER and  Content;  here  are  some  examples: 

(16.36)  a.  Cognizer  Content 

NP/Subj  S fin/Comp 

They  appreciate  that  communication  is  a two-way  process. 

b.  Cognizer  Content 

NP/Subj  Swh-/Comp 

She  appreciated  how  far  she  had  fallen  from  grace. 
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It  should  be  clear  from  examining  the  example  sentences  that  some 
generalizations  can  be  drawn  about  the  realization  of  different  thematic  roles. 

Judges,  Cognizers,  and  Agents  in  general  are  often  realized  as  subjects 
of  active  sentences.  ROLES  arc  often  realized  as  PPs  with  the  preposition 
as.  Content  is  often  realized  as  some  kind  of  S.  Representing  thematic 
roles  at  this  fine-grained  level  may  thus  make  the  mapping  to  syntax  more 
transparent.  The  problem  with  a scheme  like  Framenet  is  the  extensive 
human  effort  it  requires  in  defining  thematic  roles  for  each  domain  and  each 
frame. 

Selection  Restrictions 

The  notion  of  a selection  restriction  can  be  used  to  augment  thematic  roles  rIstrIction 
by  allowing  lexemes  to  place  certain  semantic  restrictions  on  the  lexemes  and 
phrases  that  can  accompany  them  in  a sentence.  More  specifically,  a selec- 
tion restriction  is  a semantic  constraint  imposed  by  a lexeme  on  the  concepts 
that  can  fill  the  various  argument  roles  associated  with  it.  As  with  many 
other  kinds  of  linguistic  constraints,  selection  restrictions  can  most  easily 
be  observed  in  situations  where  they  arc  violated.  Consider  the  following 
example  originally  discussed  in  Chapter  14. 

(16.37)  I wanna  eat  someplace  that’s  close  to  ICSI. 

There  arc  two  possible  parses  for  this  sentence  corresponding  to  the  intransi- 
tive and  transitive  versions  of  the  verb  eat.  These  two  parses  lead,  in  turn,  to 
two  distinct  semantic  analyses.  In  the  intransitive  case,  the  phrase  someplace 
that’s  close  to  ICSI  is  an  adjunct  that  modifies  the  event  specified  by  the  verb 
phrase,  while  in  the  transitive  case  it  provides  a true  argument  to  the  eating 
event.  This  latter  case  is  similar  in  structure  and  interpretation  to  examples 
such  as  the  following,  where  the  noun  phrase  specifies  the  thing  to  be  eaten. 

(16.38)  I wanna  eat  some  really  cheap  Chinese  food  right  now. 

Not  surprisingly,  attempting  to  analyze  Example  16.37  along  these 
lines  results  in  a kind  of  semantic  ill-formedness.  This  ill-formedness  signals 
the  presence  of  a selection  restriction  imposed  by  eat  on  its  PATIENT  role:  it 
has  to  be  something  that  is  edible.  Since  the  phrase  being  proposed  as  the 
PATIENT  in  this  scenario  can  not  easily  be  interpreted  as  edible,  the  inter- 
pretation exhibits  the  semantic  analog  of  syntactic  ungrammaticality.  This 
particular  variety  of  ill-formedness  arises  from  what  is  known  as  a selec- 
tion restriction  violation:  a situation  where  the  semantics  of  the  filler  of  a rIstrIction 
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thematic  role  is  not  consistent  with  a constraint  imposed  on  the  role  by  the 
predicate. 

This  rather  informal  description  of  selection  restrictions  needs  to  be 
refined  in  a number  of  ways  before  it  can  be  put  to  practical  use.  The  first 
refinement  concerns  the  proper  locus  for  stating  the  selection  restrictions. 
As  discussed  Sectionl6.1,  lexemes  arc  often  associated  with  a wide  variety 
of  different  senses  and,  not  surprisingly,  these  senses  can  enforce  differing 
constraints  on  their  arguments.  Selection  restrictions  therefore  arc  associated 
with  particular  senses,  not  entire  lexemes.  Consider  the  following  examples 
of  the  lexeme  serve. 

(16.39)  Well,  there  was  the  time  they  served  green-lipped  mussels  from 
New  Zealand. 

(16.40)  Which  airlines  serve  Denver? 

(16.41)  Which  ones  serve  breakfast? 

Example  16.39  illustrates  the  cooking  sense  of  serve,  which  ordinarily  re- 
stricts its  PATIENT  to  be  some  kind  foodstuff.  Example  16.40  illustrates  the 
provides  a commercial  service  to  sense  of  serve , which  constrains  its  PA- 
TIENT to  be  some  type  of  identifiable  geographic  or  political  entity.  The 
sense  shown  in  the  third  example  is  closely  related  to  the  first,  and  illustrates 
a sense  of  serve  that  is  restricted  to  specifications  of  particular  meals.  These 
differing  restrictions  on  the  same  thematic  role  of  a polysemous  lexeme  can 
be  accommodated  by  associating  them  with  distinct  senses  of  the  same  lex- 
eme. As  we  will  discuss  in  Chapter  17,  this  strongly  suggests  that  selection 
restrictions  can  be  used  to  discriminate  these  senses  in  context. 

Note  that  the  selection  restrictions  imposed  by  different  lexemes,  and 
different  senses  of  the  same  lexeme,  may  occur  at  widely  varying  levels 
of  specificity,  with  some  lexemes  expressing  very  general  conceptual  cat- 
egories, and  others  expressing  very  specific  ones  indeed.  Consider  the  fol- 
lowing examples  of  the  verbs  imagine,  lift  and  diagonalize. 

(16.42)  In  rehearsal,  I often  ask  the  musicians  to  imagine  a tennis  game. 

(16.43)  Others  tell  of  jumping  over  beds  and  couches  they  can’t  imagine 
dealing  while  awake. 

(16.44)  I cannot  even  imagine  what  this  lady  does  all  day. 

(16.45)  Atlantis  lifted  Galileo  from  the  launch  pad  at  12:54  p.m.  EDT  and 
released  the  craft  from  its  cargo  bay  about  six  hours  later. 
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(16.46)  When  the  battle  was  over,  Mr.  Kruger  lifted  the  fish  from  the  water, 
gently  removed  the  hook  from  its  jaw,  admired  it,  and  eased  it  back 
into  the  lake. 

(16.47)  To  diagonalize  a matrix,  is  to  find  its  eigenvalues. 

Given  the  meaning  of  imagine , it  is  not  surprising  to  find  that  it  places  few 
semantic  restrictions  on  the  concepts  that  can  fill  its  PATIENT  role.  Its  AGENT 
role,  on  the  other  hand,  is  restricted  to  humans  and  other  animate  entities. 
In  contrast,  the  sense  of  lift  shown  in  Examples  16.45  and  16.46  limits  its 
PATIENT  to  be  something  liftable,  which  as  these  examples  illustrate  is  a 
notion  that  must  cover  both  spacecraft  and  fish.  For  all  practical  purposes, 
this  notion  is  best  captured  by  the  fairly  general  notion  such  as  physical 
object.  Finally,  we  have  diagonalize  which  imposes  a very  specific  constraint 
on  the  filler  of  its  PATIENT  role:  it  has  to  be  a matrix. 

These  examples  serve  to  illustrate  an  important  fact  about  selection  re- 
strictions: the  concepts,  categories,  and  features  that  arc  deployed  by  the 
lexicon  as  selection  restrictions  arc  not  a paid  of  the  finite  language  capac- 
ity. Rather,  they  arc  as  open-ended  as  the  lexicon  itself.  This  distinguishes 
selection  restrictions  from  some  of  the  other  finite  features  of  language  that 
arc  used  to  define  lexemes  including  parts-of-speech,  thematic  roles,  and  se- 
mantic primitives. 

Before  we  move  on,  it  is  worth  pointing  out  that  verbs  arc  not  the 
only  part-of-speech  that  can  impose  selection  restrictions  on  their  arguments. 
Rather,  it  appeal's  to  be  the  case  that  any  predicate -beai'ing  lexeme  can  im- 
pose arbitrary  semantic  constraints  on  the  concepts  that  fill  its  argument 
roles.  Consider  the  following  examples,  which  illustrate  the  selection  re- 
strictions associated  with  some  non-verb  parts-of-speech. 

(16.48)  Radon  is  a naturally  occurring  odorless,  tasteless  gas  that  can’t  be 
detected  by  human  senses. 

(16.49)  What  is  the  lowest  fare  for  United  Airlines  flight  four  thirty? 

(16.50)  Are  there  any  restaurants  open  after  midnight? 

The  adjectives  odorless  and  tasteless  in  16.48  are  restricted  to  concepts  that 
can  possess  an  odor  or  a taste.  Similarly,  as  we  discussed  earlier  in  Section 
16.1,  the  noun  fare  is  restricted  to  various  forms  of  public  transportation.  Fi- 
nally, arguments  to  the  preposition  after  must  directly  or  indirectly  designate 
points  in  time. 
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Representing  Selection  Restrictions 

The  semantics  of  selection  restrictions  can  be  captured  in  a straightforward 
way  by  extending  the  event-oriented  meaning  representations  employed  in 
Chapter  14.  Recall  that  the  representation  of  an  event  consists  of  a single 
variable  that  stands  for  the  event,  a predicate  that  denotes  the  kind  of  event, 
and  a series  of  variables  and  relations  that  designate  the  roles  associated 
with  the  event.  Ignoring  the  issue  of  the  ^-structures,  and  using  thematic 
roles  rather  than  deep  event  roles,  the  semantic  contribution  of  a verb  like 
eat  might  look  like  the  following. 

3 e,x,y  Eating(e)  AAgent(e.x)  A Patient  (e,y ) 

With  this  representation,  all  we  know  about  y,  the  filler  of  the  Patient  role, 
is  that  it  is  associated  with  an  Eating  event  via  the  Patient  relation.  To 
stipulate  the  selection  restriction  that  y must  be  something  edible,  we  simply 
add  a new  term  to  that  effect,  as  in  the  following. 

3 e,x.y  Eating(e)  A Eater(e.x)  A Patient  (e,y ) Alsa(y,EdibleT king) 

When  a phrase  like  ate  a hamburger  is  encountered,  a semantic  analyzer  can 
form  the  following  kind  of  representation. 

3 e.x,y  Eating(e)  A Eater(e,x)  APatient(e.y)  Alsa(y,EdibleT king) 
A Isa  (y.  Hamburger ) 

This  representation  is  perfectly  reasonable  since  the  membership  of  y in 
the  category  Hamburger  is  consistent  with  its  membership  in  the  category 
EdibleThing,  assuming  a reasonable  set  of  facts  in  the  knowledge  base.  Cor- 
respondingly, the  representation  for  a phrase  such  as  ate  a takeoff  would  be 
ill-formed  because  membership  in  an  event-like  category  such  as  Takeoff 
would  be  inconsistent  with  membership  in  the  category  EdibleThing. 

While  this  approach  adequately  captures  the  semantics  of  selection  re- 
strictions, there  are  two  practical  problems  with  its  direct  use.  First,  using 
the  full  power  of  First  Order  Logic  to  perform  the  simple  task  of  enforcing 
selection  restrictions  is  overkill.  There  are  far  simpler  formalisms  that  can 
do  the  job  with  far  less  computational  cost.  The  second  problem  is  that  it 
presupposes  a large  logical  knowledge-base  of  facts  about  the  concepts  that 
make  up  selection  restrictions.  Unfortunately,  although  such  common  sense 
knowledge-bases  are  being  developed,  none  are  widely  available  and  few 
have  the  kind  of  scope  necessary  to  the  task. 

A far  more  practical  approach,  at  least  for  English,  is  to  exploit  the 
hyponymy  relations  present  in  the  WordNet  database.  In  this  approach,  se- 
lection restrictions  on  semantic  roles  are  stated  in  terms  of  WordNet  synsets, 
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Sense  1 

hamburger,  beefburger  — 

(a  fried  cake  of  minced  beef  served  on  a bun) 

=>  sandwich 

=>  snack  food 
=>  dish 

=>  nutriment,  nourishment,  sustenance... 
=>  food,  nutrient 

=>  substance,  matter 

=>  object,  physical  object 
=>  entity,  something 


Figure  16.11  Evidence  from  WordNet  that  hamburgers  are  edible. 


rather  than  logical  concepts.  A given  meaning  representation  can  be  judged 
to  be  well-formed  if  the  lexeme  that  tills  a thematic  role  has  as  one  of  its 
hypernyms,  the  synset  specified  by  the  predicate  for  that  thematic  role.  Con- 
sider how  this  approach  would  work  with  our  ate  a hamburger  example. 
Among  its  60,000  synsets,  WordNet  includes  the  following  one,  which  is 
glossed  as  any  substance  that  can  be  metabolized  by  an  organism  to  give 
energy  and  build  tissue. 

{food,  nutrient} 

Given  this  synset,  we  can  specify  it  as  the  selection  restriction  on  the  PA- 
TIENT role  of  the  verb  eat.  thus  limiting  fillers  of  this  role  to  lexemes  in  this 
synset  and  its  hyponyms.  Luckily,  the  chain  of  hypernyms  for  hamburger 
shown  in  Figure  16.3,  reveals  that  that  hamburgers  arc  indeed  food. 

Note  that  in  this  approach,  the  filler  of  a role  does  not  have  to  match 
the  restriction  synset  exactly.  Rather,  a selection  restriction  is  satisfied  if  the 
filler  has  the  restricting  synset  as  one  of  its  eventual  hypernyms.  Thus  in  the 
hamburger  example,  the  selection  restriction  synset  is  found  five  hypernym 
levels  up  from  hamburger. 

Of  course,  this  approach  also  allows  individual  lexemes  to  satisfy  re- 
strictions at  varying  levels  of  specificity.  For  example,  consider  what  hap- 
pens when  we  apply  this  approach  to  the  PATIENT  roles  of  the  verbs  imagine, 
lift  and  diagonalize,  discussed  earlier.  Let  us  restrict  imagine’s  PATIENT  to 
the  synset  {entity,  something},  lift’s  PATIENT  to  {object,  physical  object} 
and  diagonalize  to  {matrix}.  This  arrangement  correctly  permits  imagine  a 
hamburger  and  lift  a hamburger,  while  also  correctly  ruling  out  diagonalize 
a hamburger. 
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Note  that  this  approach  relies  on  the  presence  in  WordNet  of  exactly 
those  lexemes  that  specify  exactly  the  concepts  needed  for  all  possible  se- 
lection restrictions.  Unfortunately,  there  is  no  particular  reason  to  believe 
that  the  set  of  concepts  used  as  selection  restrictions  in  a language  is  exactly 
subsumed  by  the  lexemes  in  the  language.  This  situation  is  accommodated 
to  some  extent  in  WordNet  through  the  use  of  collocations  such  as  physical 
object  and  snack  food. 

To  address  this  problem  more  directly,  there  are  a number  of  linguistically- 
oriented  taxonomies  that  sit  somewhere  between  common  sense  knowledge- 
bases such  as  CYC,  and  lexical  databases  such  WordNet.  The  objects  con- 
tained in  these  hybrid  models  do  not  have  to  correspond  to  individual  lexical 
items,  but  rather  to  those  concepts  that  are  known  to  be  grammatically  and 
lexically  relevant.  In  most  cases,  the  upper  portions  of  these  taxonomies  are 
taken  to  represent  domain  and  language-independent  notions,  such  as  phys- 
ical objects,  states,  events  and  animacy.  One  of  the  most  well-developed  of 
these  ontologies  is  the  the  PENMAN  Upper  Model,  discussed  in  more  detail 
in  Chapter  20. 

Primitive  Decomposition 

The  theories  of  meaning  representation  presented  here,  and  in  the  last  few 
chapters,  have  had  a decidedly  lexical  flavor.  The  meaning  representations 
for  sentences  have  been  composed  of  atomic  symbols  that  appeal-  to  cor- 
respond very  closely  to  individual  lexemes.  However,  other  than  thematic 
roles,  these  lexical  representations  have  had  not  much  of  an  internal  struc- 
ture. The  notion  of  primitive  decomposition,  or  componential  analysis,  is 
an  attempt  to  supply  such  a structure. 

To  explore  these  notions,  consider  the  following  examples  motivated 
by  the  discussion  in  McCawley  (1968). 

(16.51)  Jim  killed  his  philodendren. 

(16.52)  Jim  did  something  to  cause  his  philodendren  to  become  not  alive. 

One  can  make  an  argument  that  these  two  sentences  mean  the  same  thing. 
However,  this  is  not  case  of  synonymy,  since  kill  is  not  synonymous  with  any 
individual  lexemes  in  16.52.  Instead,  one  can  think  of  kill  as  being  equivalent 
to  the  particular  configuration  of  more  fundamental  elements  found  in  the 
second  sentence. 

Taking  this  to  the  next  logical  step,  we  can  invoke  the  notion  of  canon- 
ical form  and  say  that  these  two  examples  should  have  the  same  meaning 
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representation  — the  one  underlying  Example  16.52.  Translating  a simple 
predicate  like  kill  into  a more  complex  set  of  predicates  can  be  viewed  as 
breaking  down,  or  decomposing,  the  meaning  of  words  into  combinations  of 
simpler,  more  primitive,  parts.  In  this  example,  the  more  primitive,  possibly 
atomic,  parts  arc  the  meaning  representations  associated  with  the  lexemes 
cause,  become  not,  and  alive. 

While  many  such  primitive  sets  of  have  been  proposed,  the  approach 
known  as  Conceptual  Dependency  (CD)  (Schank,  1972)  has  been  the  most 
widely  used  primitive -based  representational  system  within  natural  language 
processing.  In  this  approach,  eleven  primitive  predicates  arc  used  to  repre- 
sent all  predicate-like  language  expressions.  Figure  16.12  shows  the  eleven 
primitives  with  a brief  explanation  of  their  meaning. 

As  an  example  of  this  approach,  consider  the  following  sentence  along 
with  its  CD  representation. 

(16.53)  The  waiter  brought  Mary  the  check. 

3x,y  At  rans  {x)  A Act  or  (x,  Waiter)  A Object  (x,  Check)  A To(x,Mary) 
l\Ptrans{y)  A Actor  (y,  Waiter)  A Ob  ject(y,  Check)  A To(y,Mary) 
Here,  the  verb  brought  is  translated  into  the  two  primitives  ATRANS  and 
PTRANS  to  indicate  the  fact  that  the  waiter  both  physically  conveyed  the 
check  to  Mary  and  passed  control  of  it  to  here.  Note  that  CD  also  associates 
a fixed  set  of  thematic  roles  with  each  primitive  to  represent  the  various 
participants  in  the  action. 

Note  that,  in  general,  the  compositional  approach  need  not  be  limited  to 
the  meanings  of  verbs.  The  same  notion  can  be  used  to  decompose  nominals 
into  more  primitive  notions.  Consider  the  following  decompositions  of  the 
lexemes  kitten,  puppy,  and  child  into  more  primitive  elements. 

3x1 sa(x, Feline)  A Isa(x , Youth ) 

3x1  sa  (x,  Canine ) A Isa  (x,  Youth) 

3x1  sa  (x,  Human ) A Isa  (x,  Youth ) 

Here  the  primitives  represent  more  primitive  categories  of  objects,  rather 
than  actions.  Using  these  primitives,  the  close  relationship  between  these 
lexemes  and  the  related  terms  cat,  dog  and  person  can  then  be  captured  with 
the  following  similar  formulas. 

3x1  sa (x , Feline)  A 1 sa(x, Adult) 

3x1  sa  (x,  Canine)  A Isa  (x,  Adult) 

3x1  sa  (x,  Human ) A Isa  (x,  Adult ) 

The  primary  applications  of  primitives  in  natural  language  processing 
have  been  in  semantic  analysis  and  in  machine  translation.  In  semantic  anal- 
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SEMANTIC 

FIELD 


Primitive 

Definition 

Atrans 

The  abstract  transfer  of  possession  or  control  from  one  en- 
tity to  another. 

Ptrans 

The  physical  transfer  of  an  object  from  one  location  to 
another 

Mtrans 

The  transfer  of  mental  concepts  between  entities  or  within 
an  entity. 

Mbuild 

The  creation  of  new  information  within  an  entity. 

Propel 

The  application  of  physical  force  to  move  an  object. 

Move 

The  integral  movement  of  a body  paid  by  an  animal. 

Ingest 

The  taking  in  of  a substance  by  an  animal. 

Expel 

The  expulsion  of  something  from  an  animal. 

Speak 

The  action  of  producing  a sound. 

Attend 

The  action  of  focusing  a sense  organ. 

Figure  16.12 

A set  of  conceptual  dependency  primitives 

ysis,  the  principle  use  has  been  in  organizing  the  inference  process.  Instead 
of  having  to  encode  thousands  of  idiosyncratic  meaning  postulates  with  par- 
ticular lexical  items,  inference  rules  can  be  associated  with  a small  number 
of  primitives.  We  should  note  the  use  of  primitive  decomposition  in  the  rep- 
resentation on  nominals  has  largely  been  supplanted  by  the  use  of  inheritance 
hierarchies.  As  we  will  see  in  Chapter  21,  the  emphasis  in  machine  trans- 
lation has  been  on  the  use  of  primitives  as  language  independent  meaning 
representations,  or  interlinguas. 

Semantic  Fields 

The  lexical  relations  described  in  Section  16.1  had  a decidedly  local  char- 
acter, and  made  no  use  of  the  internal  structure  of  the  lexemes  taking  paid 
in  the  relation.  The  notion  of  a semantic  field  is  an  attempt  to  capture  a 
more  integrated,  or  wholistic,  relationship  among  entire  sets  of  words  from  a 
single  domain.  Consider  the  following  set  of  words  extracted  from  the  ATIS 
corpus. 

reservation,  flight,  travel,  buy,  price,  cost,  fare,  rates,  meal,  plane 

It  is  certainly  possible  to  assert  individual  lexical  relations  between 
many  of  the  lexemes  in  this  list.  The  resulting  set  of  relations  does  not,  how- 
ever, add  up  to  a complete  account  of  how  these  lexemes  are  related.  They 
arc  clearly  all  defined  with  respect  to  a coherent  chunk  of  common  sense 
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background  information  concerning  air  travel.  Background  knowledge  of 
this  kind  has  been  studied  under  a variety  of  frameworks  and  is  known  var- 
iously as  a frame  (Fillmore,  1985),  model  (Johnson-Laird,  1983),  or  script 
(Schank  and  Albelson,  1977),  and  plays  a central  role  in  a number  of  com- 
putational frameworks,  some  of  which  will  be  discussed  in  Chapter  18. 

The  FrameNet  project  (Baker  el  ah,  1998)  is  a recent  attempt  to  pro-  framenet 
vide  a robust  resource  for  this  kind  of  knowledge.  In  FrameNet,  lexemes  that 
refer  to  actions,  events,  thematic  roles,  and  objects  belonging  to  a particular 
domain  arc  linked  to  concepts  contained  in  frames  that  represent  that  partic- 
ular domain.  As  in  most  current  ontology  efforts,  these  frames  arc  arranged 
in  a hierarchy  so  that  specific  frames  can  inherit  roles  from  more  abstract 
frames.  The  current  FrameNet  effort  is  directed  at  the  creation  of  several 
thousand  frame-semantic  lexical  entries.  The  domains  to  be  covered  in- 
clude: HEALTH  CARE,  CHANCE,  PERCEPTION,  COMMUNICATION,  TRANS- 
ACTION, TIME,  SPACE,  BODY,  MOTION,  LIFE  STAGES,  SOCIAL  CONTEXT, 

and  cognition. 

16.4  Creativity  and  the  Lexicon 

The  approach  we  have  presented  thus  far  views  the  lexicon  as  a static  repos- 
itory from  which  meaning  representations  arc  retrieved  as  needed.  A more 
realistic  alternative  view  holds  that  the  lexicon  is  closer  to  a generative  de- 
vice than  a static  repository.  Rather  than  simply  retrieving  static  senses,  the 
lexicon  generates  meaning  components  appropriate  to  each  situation  on  de- 
mand. Under  this  view,  much  of  the  apparent  polysemy  in  the  lexicon  is  due 
to  this  generative  capacity.  This  capacity  is,  of  course,  not  unlimited  or  un- 
systematic. Rather,  it  is  governed  by  a number  of  productive,  or  generative, 
models  that  can  systematically  combine  lexical,  grammatical,  contextual, 
and  common  sense  knowledge  to  create  the  novel  meanings  we  see  every 
day. 

To  make  this  discussion  more  concrete,  consider  the  following  sen- 
tence from  the  WSJ  coipus. 

(16.54)  That  doesn’t  scare  Digital,  which  has  grown  to  be  the  world’s 

second-largest  computer  maker  by  poaching  customers  of  IBM’s 

mid-range  machines. 

Let’s  consider  the  meanings  of  scare  and  poach  in  this  example.  The  verb 
scare  in  WordNet  has  two  closely  related  senses:  to  cause  fear  in,  and  to 
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cause  to  lose  courage.  Although  it  might  be  interesting  to  consider  which 
of  these  senses  is  the  right  one  for  this  example,  its  even  more  interesting  to 
consider  what  it  would  mean  for  a corporation  to  lose  courage,  or  even  to 
have  it  in  the  first  place.  For  this  sentence  to  make  sense,  it  would  appeal-  to 
be  the  case  that  corporations  must  be  able  to  experience  emotions  like  fear 
or  courage.  Of  course,  they  don’t  but  we  certainly  speak  of  them  and  often 
reason  about  them  as  if  they  do. 

The  verb  poach  in  WordNet  has  a cooking  by  boiling  sense,  and  a il- 
legal taking  of  game  sense.  Intuitively,  the  use  of  poach  in  this  example  is 
closer  to  the  illegal  taking  meaning  than  the  boiling  one.  Of  course,  this  is 
clearly  not  a simple  instance  of  this  use;  the  poaching  involved  is  not  illegal, 
and  we  can  only  hope  that  the  poached  things  are  not  being  killed.  In  this 
case,  the  customers  are  being  viewed  as  a kind  of  property  belonging  to  the 
company  they  do  business  with;  and  when  they  choose  to  do  business  with 
another  company  they  have  been  stolen. 

This  ability  to  talk  about,  and  reason  about,  concepts  in  terms  of  other 
metaphor  distinct  kinds  of  concepts  is  called  metaphor  and  is  pervasive  in  all  lan- 
guages. As  a generative  model,  it  is  responsible  for  a large  proportion  of 
the  polysemy  in  the  language,  including  many  of  the  senses  that  are  listed  in 
dictionaries  as  well  as  the  more  novel  ones  that  are  not. 

Let’s  now  consider  the  following  example  from  the  WSJ. 

(16.55)  GM  killed  the  Fiero  because  it  had  dedicated  a full-scale  factory 

to... 

The  use  of  kill  in  this  example  roughly  means  to  put  an  end  to  some  kind 
of  ongoing  effort,  or  activity.  In  this  case,  the  ongoing  activity  of  building, 
marketing,  and  selling  a particular  kind  of  car.  The  metaphor  underlying  this 
use  views  activities  as  living  things,  allowing  the  termination  to  be  viewed 
as  a killing.  Note,  however,  that  this  sentence  does  not  say  any  of  this.  In 
particular,  the  PATIENT  of  the  killing  is  a definite  reference  the  Fiero.  For 
the  metaphor  to  make  sense,  this  phrase  must  refer  not  to  a particular  car,  but 
rather  to  an  entire  sales  and  production  effort  at  GM.  At  a very  high  level, 
this  is  a case  where  the  result  of  an  entire  effort,  or  process,  is  being  used 
to  refer  to  the  process  itself.  This  is  an  example  of  metonymy,  referring 
to  a concept  by  mentioning  a concept  closely  related  to  it.  Like  metaphor, 
metonymy  is  pervasive  and  goes  mostly  unnoticed  in  natural  settings. 
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16.5  Summary 

This  chapter  has  covered  a wide  range  of  issues  concerning  the  meanings 
associated  with  lexical  items.  The  following  arc  among  the  highlights: 

• Lexical  semantics  is  the  study  of  the  systematic  meaning-related  con- 
nections among  lexemes,  and  the  internal  meaning-related  structure  of 
individual  lexemes. 

• Homonymy  refers  to  lexemes  with  the  same  form  but  unrelated  mean- 
ings. 

• Polysemy  refers  to  the  notion  of  a single  lexeme  with  multiple  related 
meanings. 

• Synonymy  holds  between  different  lexemes  with  the  same  meaning. 

• Hyponomy  relations  hold  between  lexemes  that  arc  in  class-inclusion 
relationship. 

• Semantic  fields  arc  used  to  capture  semantic  connections  among  groups 
of  lexemes  drawn  from  a single  domain. 

• WordNet  is  a large  database  of  lexical  relations  for  English  words. 

• Thematic  roles  abstract  away  from  the  specifics  of  deep  semantic  roles 
by  generalizing  over  similar  roles  across  classes  of  verbs. 

• Semantic  selection  restrictions  allow  lexemes  to  post  constraints  on  the 
semantic  properties  of  the  constituents  that  accompany  them  in  sen- 
tences. 

• Primitive  decomposition  allows  permits  the  representation  of  the  mean- 
ings of  individual  lexemes  in  terms  of  finite  sets  of  sub-lexical  primi- 
tives. 

• Generative  devices  such  as  metaphor  and  metonymy  arc  pervasive,  and 
produce  novel  meanings  that  can  not  in  principle  be  captured  in  a static 
lexicon. 


Bibliographical  and  Historical  Notes 

Lyons  (1977)  and  Cruse  (1986)  arc  classic  linguistics  texts  on  lexical  seman- 
tics. Collections  describing  computational  work  on  lexical  semantics  can  be 
found  in  (Pustejovsky  and  Bergler,  1992;  Saint-Dizier  and  Viegas,  1995; 
Klavans,  1995). 
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Martin  (1986)  and  Copestake  and  Briscoe  (1995)  discuss  computa- 
tional approaches  to  the  representation  of  polysemy.  The  most  compre- 
hensive collection  of  work  concerning  WordNet  can  be  found  in  (Fellbaum, 
1998).  There  have  been  many  efforts  to  use  existing  dictionaries  as  lexical 
resources.  One  of  the  earliest  was  Amsler’s  (1980,  1981)  use  of  the  Mer- 
riam  Webster  dictionary  More  recently,  the  machine  readable  version  of 
Longman’s  Dictionary  of  Contempory  English  has  been  used  in  a number  of 
systems  (Boguraev  and  Briscoe,  1989). 

Thematic  roles,  or  case  roles,  can  be  traced  back  to  work  by  Fillmore 
(1968)  and  and  (Gruber,  1965b).  Fillmore’s  work  had  an  enormous  and  im- 
mediate impact  on  work  in  natural  language  processing.  For  a considerable 
period  of  time,  nearly  all  work  in  natural  language  understanding  used  some 
version  of  Fillmore’s  case  roles.  Much  of  the  early  work  in  this  vein  was  due 
to  Simmons  (1973b,  1978,  1983). 

Work  on  selection  restrictions  as  a way  of  characterizing  semantic 
well-formedness  began  with  (Katz  and  Fodor,  1963).  McCawley  (1968)  was 
the  first  to  point  out  that  selection  restrictions  could  not  be  restricted  to  a 
finite  list  of  semantic  features,  but  had  to  be  drawn  from  a larger  base  of 
unrestricted  world  knowledge 

Lehrer  (1974)  is  a classic  text  on  semantic  fields.  More  recent  papers 
addressing  this  topic  can  be  found  in  (Lehrer  and  Kittay,  1992).  Baker  et  al. 
(1998)  describe  ongoing  work  on  the  FrameNet  project. 

The  use  of  primitives,  components,  and  features  to  define  lexical  items 
is  ancient.  Nida  (1975)  presents  a comprehensive  overview  of  work  on  com- 
ponential  analysis.  Wierzbecka  (Wierzbicka,  1996)  has  long  been  a major 
advocate  of  the  use  of  primitives  in  linguistic  semantics.  Another  promi- 
nent effort  has  been  Jackendoff’s  Conceptual  Semantics  (Jackendoff,  1983a, 
1990)  work  which  combines  thematic  roles  and  primitive  decomposition.  On 
the  computational  side,  Schank's  Conceptual  Dependency  Schank  (1972)  re- 
mains the  most  widely  used  set  of  primitives  in  natural  language  processing. 
Wilks  (1975a)  was  an  early  promoter  of  the  use  of  primitives  in  machine 
translation,  as  well  natural  language  understanding  in  general.  More  re- 
cently, Dorr  (1993,  1992)  has  made  considerable  computational  use  of  Jack- 
endoff’s framework  in  her  work  on  machine  translation. 

An  influential  collection  of  papers  on  metaphor  can  be  found  in  (Ortony, 
1993).  Fakoff  and  Johnson  (1980)  is  the  classic  work  on  conceptual  metaphor 
and  metonymy.  Pustejovsky  (1995)  introduced  the  notion  of  the  Gener- 
ative Lexicon,  a conceptual  framework  that  rejects  the  notion  of  the  lexi- 
con as  a static  repository  in  favor  of  a more  dynamic  view.  Russell  (1976) 
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presents  one  of  the  earliest  computational  approach  to  metaphor.  Additional 
early  work  can  be  found  in  (DeJong  and  Waltz,  1983;  Wilks,  1978;  Hobbs, 
1979b).  More  recent  computational  efforts  to  analyze  metaphor  can  be  found 
in  (Fass,  1988,  1991;  Martin,  1990;  Veale  and  Keane,  1992;  Iverson  and 
Helmreich,  1992;  Chandler,  1991).  Martin  (1996)  presents  a survey  of  com- 
putational approaches  to  metaphor  and  other  types  of  figurative  language. 


Exercises 

16.1  Collect  three  definitions  of  ordinary  non-technical  English  words  from 
a dictionary  of  your  choice  that  you  feel  arc  flawed  in  some  way.  Explain 
the  nature  of  the  flaw  and  how  it  might  be  remedied. 

16.2  Download  and  install  the  current  version  of  WordNet. 

16.3  Give  a detailed  account  of  similarities  and  differences  among  the  fol- 
lowing set  of  lexemes:  imitation,  synthetic,  artificial,  fake  and  simulated. 

Examine  the  entries  for  these  lexemes  in  WordNet  (or  some  dictionary 
of  your  choice).  How  well  does  it  reflect  your  analysis? 

16.4  Consider  the  following  examples  from  (McCawley,  1968). 

My  neighbor  is  a father  of  three. 

?My  buxom  neighbor  is  a father  of  three. 

What  does  the  ill-formedness  of  the  second  example  imply  about  how 
constituents  satisfy,  or  violate,  selection  restrictions? 

16.5  Find  some  articles  about  business,  sports,  or  politics  from  your  daily 
newspaper.  Identify  as  many  lexical  metaphors  and  metonymies  as  you  can 
in  these  articles.  How  many  of  these  uses  have  reasonably  close  entries  in 
either  WordNet  or  your  favorite  dictionary? 


16.6  [more  to  come  ] 
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WORD  SENSE 
DISAMBIGUATION  AND 
INFORMATION 
RETRIEVAL 


Oh  are  you  from  Wales? 

Do  you  know  a fella  named  Jonah? 
He  used  to  live  in  whales  for  a while. 
Groucho  Marx 


This  chapter  introduces  a number  of  topics  related  to  lexical  semantic 

LEXICAL 

processing.  By  this,  we  have  in  mind  applications  that  make  use  of  word  semantic 
meanings,  but  which  are  to  varying  degrees  decoupled  from  the  more  com- 
plex tasks  of  compositional  sentence  analysis  and  discourse  understanding. 

The  first  topic  we  cover,  word  sense  disambiguation,  is  of  consider-  disamb®guae 
able  theoretical  and  practical  interest.  As  we  noted  in  Chapter  16,  the  task  of 
word  sense  disambiguation  is  to  examine  word  tokens  in  context  and  spec- 
ify which  sense  of  each  word  is  being  used.  As  we  will  see  in  the  next  two 
sections,  making  this  vague  definition  operational  is  a non-trivial  — there  is 
no  clear  consensus  as  to  exactly  what  the  task  is,  or  how  it  should  be  evalu- 
ated. Nevertheless,  there  arc  robust  algorithms  that  can  achieve  high  levels 
of  accuracy  under  certain  reasonable  assumptions. 

The  second  topic  we  cover,  information  retrieval,  is  an  extremely  ref™ewlon 
broad  field,  encompassing  a wide-range  of  topics  pertaining  to  the  storage, 
analysis,  and  retrieval  of  all  manner  of  media  (Baeza- Yates  and  Ribeiro- 
Neto,  1999).  Our  concern  in  this  chapter  is  solely  with  the  storage  and  re- 
trieval of  text  documents  in  response  to  users  requests  for  information.  We 
arc  interested  in  approaches  in  which  users’  needs  arc  expressed  as  words, 
and  documents  arc  represented  in  terms  of  the  words  they  contain.  Section 
17.3  presents  the  vector  space  model,  a well-established  approach  used  in 
most  current  systems,  including  most  Web  search  engines. 
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17.1  Selection  Restriction-Based  Disambiguation 

For  the  most  paid,  our  discussions  of  compositional  semantic  analyzers  in 
Chapter  15  ignored  the  issue  of  lexical  ambiguity.  By  now  it  should  be  clear 
that  this  is  not  a reasonable  approach.  Without  some  means  of  selecting  cor- 
rect senses  for  the  words  in  the  input,  the  enormous  amount  of  homonymy 
and  polysemy  in  the  lexicon  will  quickly  overwhelm  any  approach  in  an 
avalanche  of  competing  interpretations.  As  with  syntactic  part-of-speech 
tagging,  there  arc  two  fundamental  approaches  to  handling  this  ambiguity 
problem.  In  the  first  approach,  the  selection  of  correct  senses  occurs  during 
semantic  analysis  as  a side-effect  of  the  elimination  of  ill-formed  represen- 
tations composed  from  an  incorrect  combination  of  senses.  In  the  second 
approach,  sense  disambiguation  is  performed  as  a stand-alone  task  indepen- 
dent of,  and  prior  to,  compositional  semantic  analysis.  This  section  discusses 
the  role  of  selection  restrictions  in  the  former  approach.  The  stand-alone  ap- 
proach is  discussed  in  detail  in  17.2. 

Selection  restrictions  and  type  hierarchies  arc  the  primary  knowledge- 
sources  used  to  perform  disambiguation  in  most  integrated  approaches.  In 
particular,  they  arc  used  to  rule  out  inappropriate  senses  and  thereby  reduce 
the  amount  of  ambiguity  present  during  semantic  analysis.  If  we  assume 
an  integrated  rule-to-rule  approach  to  semantic  analysis,  then  selection  re- 
strictions can  be  used  to  block  the  formation  of  component  meaning  repre- 
sentations that  contain  violations.  By  blocking  such  ill-formed  components, 
the  semantic  analyzer  will  find  itself  dealing  with  fewer  ambiguous  meaning 
representations.  This  ability  to  focus  on  correct  senses  by  eliminating  flawed 
representations  that  result  from  incorrect  senses  can  be  viewed  as  a form  of 
indirect  word  sense  disambiguation.  While  the  linguistic  basis  for  this  ap- 
proach can  be  traced  back  to  the  work  of  Katz  and  Fodor  (1963),  the  most 
sophisticated  computational  exploration  of  it  is  due  to  Hirst  (1987). 

As  an  example  of  this  approach,  consider  the  following  pair  of  WSJ 
examples,  focusing  solely  on  their  use  of  the  lexeme  dish. 

(17.1)  “In  our  house,  everybody  has  a career  and  none  of  them  includes 
washing  dishes”,  he  says. 

(17.2)  In  her  tiny  kitchen  at  home,  Ms.  Chen  works  efficiently,  stir-frying 
several  simple  dishes,  including  braised  pig’s  ears  and  chicken  livers 
with  green  peppers. 

These  examples  make  use  of  two  polysemous  senses  of  the  lexeme  dish.  The 
first  refers  to  the  physical  objects  that  we  eat  from,  while  the  second  refers  to 
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the  actual  meals  or  recipes.  The  fact  that  we  perceive  no  ambiguity  in  these 
examples  can  be  attributed  to  the  selection  restrictions  imposed  by  wash  and 
stir-fry  on  their  PATIENT  roles,  along  with  the  semantic  type  information 
associated  with  the  two  senses  of  dish.  More  specifically,  the  restrictions 
imposed  by  wash  conflict  with  the  food  sense  of  dish  since  it  does  not  denote 
something  that  is  normally  washable.  Similarly,  the  restrictions  on  stir-fry 
conflict  with  the  artifact  sense  of  dish,  since  it  does  not  denote  something 
edible.  Therefore,  in  both  of  these  cases  the  predicate  selects  the  correct 
sense  of  an  ambiguous  argument  by  eliminating  the  sense  that  fails  to  match 
one  of  its  selection  restrictions. 

Now  consider  the  following  WSJ  and  ATIS  examples,  focusing  on  the 
ambiguous  predicate  serve. 

(17.3)  Well,  there  was  the  time  they  served  green-lipped  mussels  from  New 
Zealand. 

(17.4)  Which  airlines  serve  Denver? 

(17.5)  Which  ones  serve  breakfast? 

Here  the  sense  of  serve  in  17.3  requires  some  kind  of  food  as  its  PATIENT, 
the  sense  in  17.4  requires  some  kind  of  geographical  or  political  entity,  and 
the  sense  in  the  last  example  requires  a meal  designator.  If  we  assume  that 
mussels,  Denver  and  breakfast  arc  unambiguous,  then  in  it  is  the  arguments 
in  these  examples  that  select  the  appropriate  sense  of  the  verb. 

Of  course,  there  arc  also  cases  where  both  the  predicate  and  the  argu- 
ment have  multiple  senses.  Consider  the  following  BERP  example. 

(17.6)  I’m  looking  for  a restaurant  that  serves  vegetarian  dishes. 

Restricting  ourselves  to  three  senses  of  serve  and  two  senses  of  dish  yields 
six  possible  sense  combinations  in  this  example.  However,  since  only  one 
combination  of  the  six  is  free  from  a selection  restriction  violation,  determin- 
ing the  correct  sense  of  both  serve  and  dish  is  straightforward.  In  particular, 
the  predicate  and  argument  mutually  select  the  correct  senses. 

Before  moving  on,  we  should  note  there  will  always  be  examples  like 
the  following  where  the  available  selection  restrictions  are  too  general  to 
uniquely  select  a correct  sense. 

(17.7)  What  kind  of  dishes  do  you  recommend? 

In  cases  like  this  we  either  have  to  rely  on  the  stand-alone  methods  discussed 
in  17.2,  or  knowledge  of  the  broader  discourse  context,  as  will  be  discussed 
in  Chapter  18. 


630 


Chapter  17.  Word  Sense  Disambiguation  and  Information  Retrieval 


Although  there  arc  a wide  variety  of  ways  to  integrate  this  style  of 
disambiguation  into  a semantic  analyzer,  the  most  straightforward  approach 
follows  the  rule-to-rule  strategy  introduced  in  Chapter  15.  In  this  integrated 
approach,  fragments  of  meaning  representations  arc  composed  and  checked 
for  selection  restriction  violations  as  soon  as  their  corresponding  syntactic 
constituents  arc  created.  Those  representations  that  contain  selection  restric- 
tion violations  arc  eliminated  from  further  consideration. 

This  approach  requires  two  additions  to  the  knowledge  structures  used 
in  our  semantic  analyzers:  access  to  hierarchical  type  information  about  the 
arguments,  and  semantic  selection  restriction  information  about  the  argu- 
ments to  predicates  . Recall  from  Chapter  16,  that  both  of  these  can  be 
encoded  using  knowledge  from  WordNet.  The  first  is  available  in  form  of 
the  hypernym  information  about  the  heads  of  the  meaning  structures  being 
used  as  arguments  to  predicates.  Similarly,  selection  restriction  information 
about  argument  roles  can  be  encoded  by  associating  the  appropriate  WordNet 
synsets  with  the  arguments  to  each  predicate-hearing  lexical  item.  Exercise 
??  asks  you  to  explore  this  approach  in  more  detail. 

Limitations  of  Selection  Restrictions 

Not  surprisingly,  there  arc  a number  of  practical  and  theoretical  problems 
with  this  use  of  selection  restrictions.  The  first  symptom  of  these  problems 
is  the  fact  that  there  arc  many  perfectly  well-formed,  interpretable,  sentences 
that  contain  obvious  violations  of  selection  restrictions.  Therefore,  any  ap- 
proach based  on  a strict  elimination  of  such  interpretations  is  in  serious  trou- 
ble. 

Consider  the  following  WSJ  example. 

(17.8)  But  it  fell  apart  in  1931,  perhaps  because  people  realized  you  can’t 

eat  gold  for  lunch  if  you’re  hungry. 

The  phrase  eat  gold  clearly  violates  the  selection  restriction  that  eat  places 
on  its  PATIENT  role.  Nevertheless,  this  example  is  perfectly  well-formed. 
The  key  is  the  negative  environment  set  up  by  can ’t  prior  to  the  violation  of 
the  restriction.  This  example  makes  it  clear  that  any  purely  local,  or  rule-to- 
rule,  analysis  of  selection  restrictions  will  fail  when  a wider  context  makes 
the  violation  of  a selection  restriction  acceptable,  as  in  this  case. 

A second  problem  with  selection  restrictions  is  illustrated  by  the  fol- 
lowing example. 
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(17.9)  In  his  two  championship  trials,  Mr.  Kulkarni  ate  glass  on  an  empty 

stomach,  accompanied  only  by  water  and  tea. 

Although  the  event  described  in  this  example  is  somewhat  unusual,  the  sen- 
tence itself  is  not  semantically  ill-formed,  despite  the  violation  of  eat' s selec- 
tion restriction.  Examples  such  as  this  illustrate  the  fact  that  thematic  roles 
and  selection  restrictions  arc  merely  loose  approximations  of  the  deeper  con- 
cepts they  represent.  They  can  not  hope  to  account  for  uses  such  as  this  that 
require  deeper  conmionsense  knowledge  about  what  eating  is  all  about.  At 
best,  they  reflect  the  idea  that  the  things  that  arc  eaten  arc  normally  edible. 

Finally,  as  discussed  in  Chapter  16,  metaphoric  and  metonymic  uses 
challenge  this  approach  as  well.  Consider  the  following  WSJ  example. 

(17. 10)  If  you  want  to  kill  the  Soviet  Union,  get  it  to  try  to  eat  Afghanistan. 

Here  the  typical  selection  restrictions  on  the  PATIENTS  of  both  kill  and  eat 
will  eliminate  all  possible  literal  senses  leaving  the  system  with  no  possible 
meanings.  In  many  systems,  such  a situation  serves  to  trigger  alternative 
mechanisms  for  interpreting  metaphor  and  metonymy  (Fass,  1997). 

As  Hirst  (1987)  observes,  examples  like  these  often  result  in  the  elim- 
ination of  all  senses,  bring  semantic  analysis  to  a halt.  One  approach  to 
alleviating  this  problem  is  to  adopt  the  view  of  selection  restrictions  as  pref- 
erences, rather  than  rigid  requirements.  Although  there  have  been  many 
instantiations  of  this  approach  over  the  years  (Wilks,  1975c,  1975b,  1978), 
the  one  that  has  received  the  most  thorough  empirical  evaluation  is  Resnik’s 
(1998)  work,  which  uses  the  notion  of  a selectional  association  introduced 
on  page  ??.  Recall  that  this  notion  uses  an  empirically  derived  measure  of 
the  strength  of  association  between  a predicate  and  a class  dominating  the 
argument  to  the  predicate. 

A simplified  version  of  Resnik’s  disambiguation  algorithm  is  shown  in 
Figure  17.1.  The  basic  notion  behind  this  algorithm  is  to  select  as  the  correct 
sense  for  the  argument,  the  one  that  has  the  highest  selectional  association 
between  one  of  its  ancestor  hypernyms  and  the  predicate.  Resnik  (1998)  re- 
ports an  average  of  44%  correct  with  this  technique  for  verb-object  relation- 
ships, a result  that  is  an  improvement  over  a most  frequent  sense  baseline. 
A limitation  of  this  approach  is  that  it  only  addresses  the  case  where  the 
predicate  is  unambiguous  and  selects  the  correct  sense  of  the  argument.  A 
more  complex  decision  criteria  would  be  needed  for  the  more  likely  situation 
where  both  the  predicate  and  argument  are  ambiguous. 
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function  SA-WSD ipred,  arg ) returns  sense 

best-association  <r-  Mininum  possible  selection  association 
for  each  sense  in  senses  of  arg  do 
for  each  hypernym  in  hypernyms  of  sense  do 

new  -t—  Selectional  association  between  hyp  and  pred 
if  new  > best-association  then 
best-association  4— new 
best-sense  4—  sense 

end 

end 

return  best-sense 


Figure  17.1  Resnik’s  (1998)  selectional  association-based  word  sense  dis- 
ambiguation algorithm.  The  selection  association  between  all  the  hypernyms 
of  all  the  senses  of  the  target  argument  and  the  predicate  are  computed.  The 
sense  with  the  most  closely  associated  hypernym  is  selected. 


17.2  Robust  Word  Sense  Disambiguation 


The  selection  restriction  approach  to  disambiguation  has  too  many  require- 
ments to  be  useful  in  large-scale  practical  applications.  Even  with  the  use 
of  WordNet,  the  requirements  of  complete  selection  restriction  information 
for  all  predicate  roles,  and  complete  type  information  for  the  senses  of  all 
possible  tillers  arc  unlikely  to  be  met.  In  addition,  as  we  saw  in  Chapters  10, 
12,  and  15,  the  availability  of  a complete  and  accurate  parse  for  all  inputs  is 
unlikely  to  be  met  in  environments  involving  unrestricted  text. 

To  address  these  concerns,  a number  of  robust  disambiguation  systems 
with  more  modest  requirements  have  been  developed  over  the  years.  As 
with  part-of-speech  taggers,  these  systems  arc  designed  to  operate  in  a stand- 
alone fashion  and  make  minimal  assumptions  about  what  information  will  be 
available  from  other  processes. 

Machine  Learning  Approaches 

In  machine  learning  approaches,  systems  arc  trained  to  perform  the  task 
of  word  sense  disambiguation.  In  these  approaches,  what  is  learned  is  a 
classifier  that  can  be  used  to  assign  as  yet  unseen  examples  to  one  of  a fixed 
number  of  senses.  As  we  will  see,  these  approaches  vary  as  to  the  nature 
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of  the  training  material,  how  much  material  is  needed,  the  degree  of  human 
intervention,  the  kind  of  linguistic  knowledge  used,  and  the  output  produced. 

What  they  all  share  is  an  emphasis  on  acquiring  the  knowledge  needed  for 
the  task  from  data,  rather  than  from  human  analysts.  The  principal  question 
to  keep  in  mind  as  we  explore  these  systems  is  whether  the  method  scales; 
that  is,  would  it  be  possible  to  apply  the  method  to  a substantial  paid  of  the 
entire  vocabulary  of  a language? 

The  Inputs:  Feature  Vectors 

Before  discussing  the  algorithms,  we  should  first  characterize  the  kind  of 
inputs  they  expect.  In  most  of  these  approaches,  the  initial  input  consists  of 
the  word  to  be  disambiguated,  which  we  will  refer  to  as  the  target  word, 
along  with  a portion  of  the  text  in  which  it  is  embedded,  which  we  will  call 
its  context.  This  initial  input  is  then  processed  in  the  following  ways: 

• The  input  is  normally  part-of-speech  tagged  using  one  of  the  high  ac- 
curacy methods  described  in  Chapter  8. 

• The  original  context  may  be  replaced  with  larger  or  smaller  segments 
surrounding  the  target  word. 

• Often  some  amount  of  stemming,  or  more  sophisticated  morphological 
processing,  is  performed. 

• Less  often,  some  form  of  partial  parsing,  or  dependency  parsing,  is 
performed  to  ascertain  thematic  or  grammatical  roles  and  relations. 

After  this  initial  processing,  the  input  is  then  boiled  down  to  a fixed  set 
of  features  that  capture  information  relevant  to  the  learning  task.  This  task 
consists  of  two  steps:  selecting  the  relevant  linguistic  features,  and  encoding 
them  in  a form  usable  in  a learning  algorithm.  Fortunately,  a simple  feature 
vector  consisting  of  numeric  or  nominal  values  can  easily  encode  the  most  v!ctore 
frequently  used  linguistic  information,  and  is  appropriate  for  use  in  most 
learning  algorithms 

The  linguistic  features  used  in  training  WSD  systems  can  be  roughly 
divided  into  two  classes:  collocational  features  and  co-occurrence  features. 

In  general,  the  term  collocation  refers  to  a quantifiable  position-specific  re-  collocation 
lationship  between  two  lexical  items.  Collocational  features  encode  infor- 
mation about  the  lexical  inhabitants  of  specific  positions  located  to  the  left 
and  right  of  the  target  word.  Typical  items  in  this  category  include  the  word, 
the  root  form  of  the  word,  and  the  word’s  part-of-speech.  This  type  of  fea- 
ture is  effective  at  encoding  local  lexical  and  grammatical  information  that 
can  often  accurately  isolate  a given  sense. 


634 


Chapter  17.  Word  Sense  Disambiguation  and  Information  Retrieval 


SUPERVISED 

LEARNING 


As  an  example  of  this  type  of  feature-encoding,  consider  the  situation 
where  we  need  to  disambiguate  the  lexeme  bass  in  the  following  example. 

(17.11)  An  electric  guitar  and  bass  player  stand  off  to  one  side,  not  really 

part  of  the  scene,  just  as  a sort  of  nod  to  gringo  expectations  perhaps. 

A feature-vector  consisting  of  the  two  words  to  the  right  and  left  of  the  target 
word,  along  with  their  respective  parts-of-speech,  would  yield  the  following 
vector. 

[guitar,  NN1,  and,  CJC,  player,  NN1,  stand,  VVB] 

The  second  type  of  feature  consists  of  co-occurrence  data  about  neigh- 
boring words,  ignoring  their  exact  position.  In  this  approach,  the  words 
themselves  (or  their  roots)  serve  as  features.  The  value  of  the  feature  is  the 
number  of  times  the  word  occurs  in  a region  surrounding  the  target  word. 
This  region  is  most  often  defined  as  a fixed  size  window  with  the  target  word 
at  the  center.  To  make  this  approach  manageable,  a small  number  of  fre- 
quently used  content  words  arc  selected  for  use  as  features.  This  kind  of 
feature  is  effective  at  capturing  the  general  topic  of  the  discourse  in  which 
the  target  word  has  occurred.  This,  in  turn,  tends  to  identify  senses  of  a word 
that  arc  specific  to  certain  domains. 

For  example,  a co-occurrence  vector  consisting  of  the  12  most  frequent 
content  words  from  a collection  of  bass  sentences  drawn  from  the  WSJ  cor- 
pus would  have  the  words  as  features:  fishing,  big,  sound,  player,  fly,  rod, 
pound,  double,  runs,  playing,  guitar,  band.  Using  these  words  as  features 
with  a window  size  of  10,  Example  17.11  would  be  represented  by  the  fol- 
lowing vector. 

[0,0,0,1,0,0,0,0,0,0,1,0] 

As  we  will  see,  most  robust  approaches  to  sense  disambiguation  make 
use  of  a combination  of  both  collocational  and  co-occurrence  features. 

Supervised  Learning  Approaches 

In  supervised  approaches,  a sense  disambiguation  system  is  learned  from  a 
representative  set  of  labeled  instances  drawn  from  the  same  distribution  as 
the  test  set  to  be  used.  This  is  a straightforward  application  of  the  supervised 
learning  approach  to  creating  a classifier.  In  such  approaches,  a learning 
system  is  presented  with  a training  set  consisting  of  feature-encoded  inputs 
along  with  their  appropriate  label,  or  category.  The  output  of  the  system  is  a 
classifier  system  capable  of  assigning  labels  to  new  feature-encoded  inputs. 


Methodology  Box:  Evaluating  WSD  Systems 


The  basic  metric  used  in  evaluating  sense  disambiguation  sys- 
tems is  simple  precision:  the  percentage  of  words  that  arc  tagged 
correctly.  The  primary  baseline  against  which  this  metric  is  com- 
pared is  the  most  frequent  sense  metric:  how  well  would  a system 
do  if  it  simply  chose  the  most  frequent  sense  of  a word. 

The  use  of  precision  requires  access  to  the  correct  answers  to  the 
words  in  a test  set.  Fortunately,  two  large  sense-tagged  corpora  arc 
now  available:  the  SEMCOR  corpus  (Landes  etal.,  1998),  which  con- 
sists of  a portion  of  the  Brown  corpus  tagged  with  WordNet  senses, 
and  the  SENSEVAL  corpus  (Kilgarriff  and  Rosenzweig,  2000),  which 
is  a tagged  corpus  derived  from  the  HECTOR  corpus  and  dictionary 
project. 

A number  of  issues  must  be  taken  into  account  in  comparing 
results  across  systems.  The  main  issue  concerns  the  nature  of  the 
senses  used  in  the  evaluation.  Two  approaches  have  been  followed 
over  the  year's:  coarse  distinctions  among  homographs,  such  as  the 
musical  and  fish  senses  of  bass,  and  fine-grained  sense  distinctions 
such  as  those  found  in  traditional  dictionaries.  Unfortunately,  there 
is  no  standard  way  of  comparing  results  across  these  two  kinds  of 
efforts,  or  across  efforts  using  different  dictionaries. 

Dictionary  senses  provide  the  opportunity  for  a more  fine- 
grained scoring  metric  than  simple  precision.  For  example,  con- 
fusing a particular'  musical  sense  of  bass  with  a fish  sense,  is  clearly 
worse  than  confusing  it  with  another  musical  sense.  This  observa- 
tion gives  rise  to  a notion  of  partial  credit  in  evaluating  these  sys- 
tems. With  such  a metric,  an  exact  sense-match  would  receive  full 
credit,  while  selecting  a broader  sense  would  receive  partial  credit. 
Of  course,  this  kind  of  scheme  is  entirely  dependent  on  the  organi- 
zation of  senses  in  the  particular-  dictionary  being  used. 

Standardized  evaluation  frameworks  for  word  sense  disam- 
biguation systems  are  now  available.  In  particular-,  the  SENSEVAL 
effort  (Kilgarriff  and  Palmer,  2000),  provides  the  same  kind  of  eval- 
uation framework  for  sense  disambiguation,  that  the  MUC  (Sund- 
heim,  1995b)  and  TREC  (Voorhees  and  Harman,  1998)  evaluations 
have  provided  for  information  extraction  and  information  retrieval. 
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Bayesian  classifiers  (Duda  and  Hart,  1973),  decision  lists  (Rivest,  1987), 
decision  trees  (Quinlan,  1986),  neural  networks  (Rumelhart  et  ah,  1986), 
logic  learning  systems  (Mooney,  1995),  and  nearest  neighbor  methods(Cover 
and  Hail,  1967)  all  fit  into  this  paradigm.  We  will  restrict  our  discussion  to 
the  naive  Bayes  and  decision  list  approaches,  since  they  have  been  the  focus 
of  considerable  work  in  word  sense  disambiguation. 
naive  bayes  The  naive  Bayes  classifier  approach  to  WSD  is  based  on  the  premise 

that  choosing  the  best  sense  for  an  input  vector  amounts  to  choosing  the  most 
probable  sense  given  that  vector.  In  other  words: 


s = ai'gmaxR(5'|V) 

ses 


(17.12) 


In  this  formula,  S denotes  the  set  of  senses  appropriate  for  the  target  associ- 
ated with  this  vector.  As  is  almost  always  the  case,  it  would  be  difficult  to 
collect  statistics  for  this  equation  directly.  Instead,  we  rewrite  it  in  the  usual 
Bayesian  manner  as  follows: 


s = argmax 

seS 


P{V\s)P{s) 

W) 


(17.13) 


Of  course,  the  data  available  that  associates  specific  vectors  with  senses 
is  too  sparse  to  be  useful.  What  is  provided  in  abundance  in  the  training  set 
is  information  about  individual  feature-value  pairs  in  the  context  of  specific 
senses.  Therefore,  we  can  make  the  same  independence  assumption  that 
has  served  us  well  in  part-of-speech  tagging,  speech  recognition,  and  prob- 
abilistic parsing  — assume  that  the  features  arc  independent  of  one  another. 
Making  this  assumption  yields  the  following  equation. 


n 


P(V|s)  = rp(V;k) 

j= 1 


(17.14) 


Given  this  equation,  training  a Naive  Bayes  classifier  amounts  to  col- 
lecting counts  of  the  individual  feature-value  statistics  with  respect  to  each 
sense  of  the  target  word.  The  term  P(s)  is  the  prior  for  each  sense,  which  just 
corresponds  to  the  proportion  of  each  sense  in  the  training  set.  Finally,  since 
P(V)  is  the  same  for  all  possible  senses  it  does  not  effect  the  final  ranking  of 
senses,  leaving  us  with  the  following. 


n 

s = argmax.P(s) 

sinS  j—\ 


(17.15) 


Of  course,  all  the  issues  discussed  in  Chapter  8 with  respect  to  zero  counts 
and  smoothing  apply  here  as  well. 
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Rule  Sense 


fish  within  window 

bass 

striped  bass 

bass 

guitar  within  window 

bass 

bass  player 

bass 

piano  within  window 

bass 

tenor  within  window 

bass 

sea  bass 

=> 

bass 

playfV  bass 

bass 

river  within  window 

bass 

violin  within  window 

bass 

salmon  within  window 

bass 

on  bass 

bass 

bass  are 

=> 

bass 

Figure  17.2  An  abbreviated  decision  list  for  disambiguating  the  fish  sense 
of  bass  from  the  music  sense.  (Adapted  from  (Yarowsky,  1996)) 


In  a large  experiment  evaluating  a number  of  supervised  learning  al- 
gorithms, Mooney  (1996)  reports  that  a naive-Bayes  classifier  and  a neural 
network  achieved  the  highest  performance,  both  achieving  around  73%  cor- 
rect in  assigning  one  of  6 senses  to  a corpus  of  examples  of  the  word  line. 

Decision  list  classifiers  can  be  viewed  as  a simplified  valiant  of  deci-  listision 
sion  trees.  In  a decision  list  classifier,  a sequence  of  tests  is  applied  to  each 
vector  encoded  input.  If  a test  succeeds,  then  the  sense  associated  with  that 
test  is  applied  to  the  input  and  returned.  If  the  test  fails,  then  the  next  test 
in  the  sequence  is  applied.  This  continues  until  the  end  of  the  list,  where  a 
default  test  simply  returns  the  majority  sense.  Figure  17.2  shows  a portion 
of  a decision  list  for  the  task  of  discriminating  the  fish  sense  of  bass  from  the 
music  sense. 

Learning  a decision  list  classifier  consists  of  creating  a good  sequence 
of  tests  based  on  the  characteristics  of  the  training  data.  There  arc  wide 
number  of  methods  that  can  be  used  to  create  such  lists.  Yarowsky  (1994) 
employs  an  extremely  simple  technique  that  yields  excellent  results  in  this 
domain.  In  this  approach,  all  possible  feature-value  pairs  arc  used  to  create 
tests.  These  individual  tests  arc  then  ordered  according  to  their  individual 
accuracy  on  the  training  set,  where  the  accuracy  of  a test  is  based  on  its 
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log-likelihood  ratio: 

The  decision  list  is  created  from  these  tests  by  simplying  ordering  the  tests 
in  the  list  according  to  this  measure,  with  each  test  returning  the  appropriate 
sense.  Yarowsky  (1996)  reports  that  this  technique  consistently  achieves 
over  95%  correct  on  a wide  variety  of  binary  decision  tasks. 

We  should  note  that  this  training  method  differs  quite  a bit  from  the 
standard  decision  list  learning  algorithm.  For  the  details  and  theoretical  mo- 
tivation for  that  approach  see  (Rivest,  1987;  Russell  and  Norvig,  1995). 


Bootstrapping  Approaches 

Not  surprisingly,  a major  problem  with  supervised  approaches  is  the  need 
for  a large  sense-tagged  training  set.  The  bootstrapping  approach  (Hearst, 
1991;  Yarowsky,  1995)  eliminates  the  need  for  a large  training  set  by  relying 
on  a relatively  small  number  of  instances  of  each  sense  for  each  lexeme  of 
interest.  These  labeled  instances  arc  used  as  seeds  to  train  an  initial  classifier 
using  any  of  the  supervised  learning  methods  mentioned  in  the  last  section. 
This  initial  classifier  is  then  be  used  to  extract  a larger  training  set  from 
the  remaining  untagged  corpus.  Repeating  this  process  results  in  a series  of 
classifiers  with  improving  accuracy  and  coverage. 

The  key  to  this  approach  lies  in  its  ability  to  create  a larger  training  set 
from  a small  set  of  seeds.  To  succeed,  it  must  include  only  those  instances 
in  which  the  initial  classifier  has  a high  degree  of  confidence.  This  larger 
training  set  is  then  used  to  create  a new  more  accurate  classifier  with  broader 
coverage.  With  each  iteration  of  this  process,  the  training  corpus  grows  and 
the  untagged  corpus  shrinks.  As  with  most  iterative  methods,  this  process 
can  be  repeated  until  some  sufficiently  low  error-rate  on  the  training  set  is 
reached,  or  until  no  further  examples  from  the  untagged  coipus  arc  above 
threshold. 

The  initial  seed  set  used  in  these  bootstrapping  methods  can  be  gen- 
erated in  a number  of  ways.  Hearst  (1991)  generates  a seed  set  by  hand 
labeling  a small  set  of  examples  from  the  initial  corpus.  This  approach  has 
three  major  advantages: 

• There  is  a reasonable  certainty  that  the  seed  instances  arc  correct,  thus 
ensuring  that  the  learner  does  not  get  off  on  the  wrong  foot 

• The  analyst  can  make  some  effort  to  choose  examples  that  arc  not  only 
correct,  but  in  some  sense  prototypical  of  each  sense. 
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Klucevsek  plays  Giulietti  or  Titano  piano  accordions  with  the  more  flexible,  more 
difficult  free  bass  rather  than  the  traditional  Stradella  bass  with  its  preset  chords 
designed  mainly  for  accompaniment. 

We  need  more  good  teachers  - right  now,  there  are  only  a half  a dozen  who  can 
play  the  free  bass  with  ease. 

An  electric  guitar  and  bass  player  stand  off  to  one  side,  not  really  part  of  the 
scene,  just  as  a sort  of  nod  to  gringo  expectations  perhaps. 

When  the  New  Jersey  Jazz  Society,  in  a fund-raiser  for  the  American  Jazz  Hall  of 
Fame,  honors  this  historic  night  next  Saturday,  Harry  Goodman,  Mr.  Goodman’s 
brother  and  bass  player  at  the  original  concert,  will  be  in  the  audience  with  other 
family  members. 

The  researchers  said  the  worms  spend  part  of  their  life  cycle  in  such  fish  as  Pacific 
salmon  and  striped  bass  and  Pacific  rockfish  or  snapper. 

Associates  describe  Mr.  Whitacre  as  a quiet,  disciplined  and  assertive  manager 
whose  favorite  form  of  escape  is  bass  fishing. 

And  it  all  started  when  fishermen  decided  the  striped  bass  in  Lake  Mead  were  too 
skinny. 

Though  still  a far  cry  from  the  lake’s  record  52-pound  bass  of  a decade  ago,  ’’you 
could  fillet  these  fish  again,  and  that  made  people  very,  very  happy,”  Mr.  Paulson 
says. 

Saturday  morning  I arise  at  8:30  and  click  on  ’’America’s  best-known  fisherman,” 
giving  advice  on  catching  bass  in  cold  weather  from  the  seat  of  a bass  boat  in 
Louisiana. 

Figure  17.3  Samples  of  bass  sentences  extracted  from  the  WSJ  using  the 
simple  correlates  play  and  fish. 


• It  is  reasonably  easy  to  carry  out. 

A remarkably  effective  alternative  technique  is  to  simply  search  for 
sentences  containing  single  words  that  arc  strongly  correlated  with  the  target 
senses.  Yarowsky  (1995)  calls  this  the  One  Sense  per  Collocation  constraint 
and  presents  results  that  show  that  it  yields  remarkably  good  results.  For 
example.  Figure  17.3  shows  a partial  result  of  a such  a search  for  the  strings 
“fish”  and  “play”  in  a corpus  of  bass  examples  drawn  from  the  WSJ. 

Yarowsky  (1995)  suggests  two  methods  to  select  effective  correlates: 
deriving  them  from  machine  readable  dictionary  entries,  and  selecting  seeds 
using  collocations  statistics  such  as  those  described  in  Chapter  6.  Putting  all 
of  this  to  the  test,  Yarowsky  (1995)  reports  an  average  performance  of  96.5% 
on  a coarse  binary  sense  assignment  of  12  words. 
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Unsupervised  Methods:  Discovering  Word  Senses 

Unsupervised  approaches  to  sense  disambiguation  eschew  the  use  of  sense 
tagged  data  of  any  kind  during  training.  In  these  approaches,  feature-vector 
representations  of  unlabeled  instances  arc  taken  as  input  and  arc  then  grouped 
into  clusters  according  to  a similarity  metric.  These  clusters  can  then  be  rep- 
resented as  the  average  of  their  constituent  feature-vectors,  and  labeled  by 
hand  with  known  word  senses.  Unseen  feature-encoded  instances  can  be 
classified  by  assigning  them  the  word  sense  from  the  cluster  to  which  they 
arc  closest  according  to  the  similarity  metric. 

Fortunately,  clustering  is  a well-studied  problem  with  a wide  number 
of  standard  algorithms  that  can  be  applied  to  inputs  structured  as  vectors  of 
numerical  values  (Duda  and  Hail,  1973).  The  most  frequently  used  tech- 
nique in  language  applications  is  known  as  agglomerative  clustering.  In 
this  technique,  each  of  the  N training  instances  is  initially  assigned  to  its 
own  cluster.  New  clusters  arc  then  formed  in  a bottom-up  fashion  by  succes- 
sively merging  the  two  clusters  that  arc  most  similar.  This  process  continues 
until  a either  a specified  number  of  clusters  is  reached,  or  some  global  good- 
ness measure  among  the  clusters  is  achieved.  In  cases  where  the  number  of 
training  instances  makes  this  method  too  expensive,  random  sampling  can 
be  used  on  the  original  training  set  (Cutting  et  at,  1992b)  to  achieve  similar 
results. 

Of  course,  the  fact  that  these  unsupervised  methods  do  not  make  use 
of  hand-labeled  data  poses  a number  of  challenges  for  evaluating  the  good- 
ness of  any  clustering  result.  The  following  problems  arc  among  the  most 
important  ones  that  have  to  be  addressed  in  unsupervised  approaches. 

• The  correct  senses  of  the  instances  used  in  the  training  data  may  not  be 
known. 

• The  clusters  arc  almost  certainly  heterogeneous  with  respect  to  the 
senses  of  the  training  instances  contained  within  them. 

• The  number  of  clusters  is  almost  always  different  from  the  number  of 
senses  of  the  target  word  being  disambiguated. 

Schiitze’s  experiments  (Schiitze,  1992,  1998)  constitute  the  most  ex- 
tensive application  of  unsupervised  clustering  to  word  sense  disambiguation 
to  date.  Although  the  actual  technique  is  quite  involved,  unsupervised  ag- 
glomerative clustering  is  at  the  core  of  the  method.  As  with  the  supervised 
approaches,  the  bulk  of  this  work  is  directed  at  coarse  binary  distinctions.  In 
this  work,  the  first  two  problems  arc  addressed  through  the  use  of  pseudo- 
words and  a hand-labeling  of  a small  subset  of  the  instances  in  each  cluster. 
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The  heterogeneity  issue  is  addressed  by  assigning  the  majority  sense  to  each 
of  the  induced  clusters.  Given  this  approach,  the  last  problem  is  not  an  is- 
sue; the  various  discovered  clusters  arc  simply  labeled  with  their  majority 
sense.  The  fact  that  there  may  be  multiple  clusters  with  the  same  sense  is  not 
directly  an  issue  in  disambiguation. 

Schiitze’s  results  indicate  that  for  coarse  binary  distinctions,  unsuper- 
vised techniques  can  achieve  results  approaching  those  of  supervised  and 
bootstrap  methods.  In  most  instances  approaching  the  90%  range.  As  with 
most  of  the  supervised  methods,  this  method  was  tested  on  a small  sample 
of  words  (10  pseudowords,  and  10  real  words). 

Dictionary-Based  Approaches 

A major  drawback  with  all  of  the  approaches  described  above  is  the  problem 
of  scale.  All  require  a considerable  amount  of  work  to  create  a classifier  for 
each  ambiguous  entry  in  the  lexicon.  For  this  reason,  most  of  the  experi- 
ments with  these  methods  report  results  ranging  from  2 to  12  lexical  items 
(The  work  of  Ng  and  Lee  (1996)  is  a notable  exception  reporting  results  dis- 
ambiguating 121  nouns  and  70  verbs).  Scaling  up  any  of  these  approaches  to 
deal  with  all  the  ambiguous  words  in  a language  would  be  a large  undertak- 
ing. Instead,  attempts  to  perform  large-scale  disambiguation  have  focused  on 
the  use  of  machine  readable  dictionaries,  of  the  kind  discussed  in  Chap- 
ter 16.  In  this  style  of  approach,  the  dictionary  provides  both  the  means  for 
constructing  a sense  tagger,  and  the  target  senses  to  be  used. 

The  first  implementation  of  this  approach  is  due  to  Lesk  (1986).  In 
this  approach,  all  the  sense  definitions  of  the  word  to  be  disambiguated  arc 
retrieved  from  the  dictionary.  These  senses  arc  then  compared  to  the  dictio- 
nary definitions  of  all  the  remaining  words  in  the  context.  The  sense  with  the 
highest  overlap  with  these  context  words  is  chosen  as  the  correct  sense.  Note 
that  the  various  sense  definitions  of  the  context  words  arc  simply  lumped  to- 
gether in  this  approach.  Lesk  reports  accuracies  of  50-70%  on  short  samples 
of  text  selected  from  Austen’s  Pride  and  Prejudice  and  an  AP  newswire  ar- 
ticle. 

The  problem  with  this  approach  is  that  dictionary  entries  for  the  vari- 
ous senses  of  target  words  arc  relatively  short,  and  may  not  provide  sufficient 
material  to  create  adequate  classifiers.  1 More  specifically,  the  words  used 
in  the  context  and  their  definitions  must  have  direct  overlap  with  the  words 

1 Indeed,  Lesk  (Lesk,  1986)  notes  that  the  performance  of  his  system  seems  to  roughly 
correlate  with  the  length  of  the  dictionary  entries. 
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contained  in  the  appropriate  sense  definition  in  order  to  be  useful.  One  way 
to  remedy  this  problem  is  to  expand  the  list  of  words  used  in  the  classifier 
to  include  words  related  to,  but  not  contained  in  their  individual  sense  def- 
initions. This  can  be  accomplished  by  including  words  whose  definitions 
make  use  of  the  target  word.  For  example,  the  word  deposit  does  not  oc- 
cur in  the  definition  of  bank  in  the  American  Heritage  Dictionary  (Morris, 
1985).  However,  bank  does  occur  in  the  definition  of  deposit.  Therefore,  the 
classifier  for  bank  can  be  expanded  to  include  deposit  as  a relevant  feature. 

Of  course,  just  knowing  that  deposit  is  related  to  bank  does  not  help 
much  since  we  don’t  know  to  which  of  bank’s  senses  it  is  related.  Specifi- 
cally, to  make  use  of  deposit  as  a feature  we  have  to  know  which  sense  of 
bank  was  being  used  in  its  definition.  Fortunately,  many  dictionaries  and 
thesauri  include  tags  known  as  subject  codes  in  their  entries  that  correspond 
roughly  to  broad  conceptual  categories.  For  example,  the  entry  for  bank 
in  the  Longman ’s  Dictionary  of  Contemporary  English  (LDOCE)  (Procter, 
1978)  includes  the  subject  code  EC  (Economics)  for  the  financial  senses  of 
bank.  Given  such  subject  codes,  we  can  guess  that  expanded  terms  with 
the  subject  code  EC  will  be  related  to  this  sense  of  bank  rather  than  any  of 
the  others.  Guthrie  et  al.  (1991)  report  results  ranging  of  47%  correct  for 
fine-grained  LDOCE  distinctions  to  72%  for  more  coarse  distinctions. 

Note  that  none  of  these  techniques  actually  exploit  the  dictionary  en- 
tries as  definitions.  Rather,  they  can  be  viewed  as  valiants  of  the  supervised 
learning  approach,  where  the  content  of  the  dictionary  is  used  to  provide  the 
tagged  training  materials. 


17.3  Information  Retrieval 

The  field  of  information  retrieval  is  of  interest  to  us  here  due  to  its  widespread 
adoption  of  word-based  indexing  and  retrieval  methods.  Most  current  infor- 
mation retrieval  systems  arc  based  on  an  extreme  interpretation  of  the  princi- 
ple of  compositional  semantics.  In  these  systems,  the  meaning  of  documents 
resides  solely  in  the  words  that  arc  contained  within  them.  To  revisit  the 
Mad  Hatter’s  quote  from  the  beginning  of  Chapter  16,  in  these  systems  I see 
what  I eat  and  I eat  what  I see  mean  precisely  the  same  thing.  The  order- 
ing and  constituency  of  the  words  that  make  up  the  sentences  that  make  up 
documents  play  no  role  in  determining  their  meaning.  Because  they  ignore 
words  syntactic  information,  these  approaches  arc  often  referred  to  as  bag  of  words 
methods. 
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Before  moving  on,  we  need  to  introduce  some  new  terminology.  In  in- 
formation retrieval,  a document  refers  generically  to  the  unit  of  text  indexed 
in  the  system  and  available  for  retrieval.  Depending  on  the  application,  a 
document  can  refer  to  anything  from  intuitive  notions  like  newspaper  arti- 
cles, or  encyclopedia  entries,  to  smaller  units  such  as  paragraphs  and  sen- 
tences. In  Web-based  applications,  it  can  refer  to  a Web  page,  a paid  of  a 
page,  or  to  an  entire  Web-site.  A collection  refers  to  a set  of  documents  be- 
ing used  to  satisfy  user  requests.  A term  refers  to  a lexical  item  that  occurs 
in  a collection,  but  it  may  also  include  phrases.  Finally,  a query  represents  a 
user’s  information  need  expressed  as  a set  of  terms. 

The  specific  information  retrieval  task  that  we  will  consider  in  detail  is 
known  as  ad  hoc  retrieval.  In  this  task,  it  is  assumed  that  an  unaided  user 
poses  a query  to  a retrieval  system,  which  then  returns  a possibly  ordered 
set  of  potentially  useful  documents.  Several  other  related,  lexically  oriented, 
information  retrieval  tasks  will  be  discussed  in  Section  17.4. 

The  Vector  Space  Model 

In  the  vector  space  model  of  information  retrieval,  documents  and  queries 
arc  represented  as  vectors  of  features  representing  the  terms  that  occur  within 
them  (Salton,  1971).  More  properly,  they  arc  represented  as  vectors  of  fea- 
tures consisting  of  the  terms  that  occur  within  the  collection,  with  the  value 
of  each  feature  indicating  the  presence  or  absence  of  a given  term  in  a given 
document.  These  vectors  can  be  denoted  as  follows: 

d ( / ] . / 2 - fr . ■ ■ ■ ■ / v ) 

q = {h,t2,h,- ■ ■ ,ttf) 

In  this  notation,  the  various  t features  represent  the  N terms  that  occur  in  the 
collection.  Let’s  first  consider  the  case  where  these  features  take  on  the  value 
of  one  or  zero,  indicating  the  presence  or  absence  of  a term  in  a document 
or  query.  Given  this  approach,  a simple  way  to  compare  a document  to  a 
query,  or  another  document,  is  to  sum  up  the  number  of  terms  they  have  in 
common,  as  in  the  following  equation. 
n 

s{qk,dj)  = Y,  ti-k  x U,j  (17.17) 

/-I 

Of  course,  a problem  with  the  use  of  binary  values  for  features  is  that 
it  fails  to  capture  the  fact  that  some  terms  are  more  important  to  the  meaning 
of  a document  than  others.  A useful  generalization  is  to  replace  the  ones 
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and  zeroes  with  numerical  weights  that  indicate  the  importance  of  the  vari- 
ous terms  in  particular  documents  and  queries.  We  can  thus  generalize  our 
vectors  as  follows: 

dj  = (wij,W2J,W3J,"-,Wnj) 

4k  = W2,k,W3,k,  ■ ■ ' i Wn,k) 

This  characterization  of  individual  documents  as  vectors  of  term  weights 
allows  us  to  view  the  document  collection  as  a whole  a matrix  of  weights, 
where  w,-j  represents  the  weight  of  term  i in  document  j.  This  weight  ma- 
trix is  typically  called  a term-by-document  matrix.  Under  this  view,  the 
columns  of  the  matrix  represent  the  documents  in  the  collection,  and  the 
rows  represent  the  terms. 

A useful  view  of  this  model  conceives  of  the  features  used  to  represent 
documents  (and  queries)  as  dimensions  in  a multi-dimensional  space.  Corre- 
spondingly, the  weights  that  serve  as  values  for  those  features  serve  to  locate 
documents  in  that  space.  When  a user’s  query  is  translated  into  a vector  it 
denotes  a point  in  that  space.  Documents  that  are  located  close  to  the  query 
can  then  be  judged  as  being  more  relevant  than  documents  that  are  farther 
away. 

This  characterization  of  documents  and  queries  as  vectors,  provides  all 
the  basic  parts  for  an  ad  hoc  retrieval  system.  A document  retrieval  system 
can  simply  accept  a user’s  query,  create  a vector  representation  for  it,  com- 
pare it  against  the  vectors  representing  all  known  documents,  and  sort  the 
results.  The  result  is  a list  of  documents  rank  ordered  by  their  similarity  to 
the  query. 

Consider  as  an  example  of  this  approach,  the  space  shown  in  Figure 
17.4.  This  figure  shows  a simplified  space  consisting  of  the  three  dimensions 
corresponding  to  the  terms  speech , language  and  processing.  The  three  vec- 
tors illustrated  in  this  space  represent  documents  derived  from  the  chapter 
and  section  headings  of  Chapters  1,  7,  and  13  of  this  text,  which  we'll  de- 
note as  Docl,  Doc7,  and  Docl3,  respectively.  If  we  identify  term  weights 
with  raw  term  frequency,  then  Docl  is  represented  by  the  vector  (1,2,1), 
Doc7  by  (6.0. 1),  and  Docl3  by  (0,5, 1).  As  is  clear  from  the  figure,  this 
space  captures  certain  intuitions  about  how  these  chapters  arc  related.  Chap- 
ter 1,  being  general,  is  fairly  similar  to  both  Chapters  7 and  13.  Chapters  7 
and  13,  on  the  other  hand,  arc  distant  from  one  another  since  they  cover  a 
different  set  of  topics. 

Unfortunately,  this  particular  instantiation  of  a vector  space  places  too 
much  emphasis  on  the  absolute  values  of  the  various  coordinates  of  each 
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Figure  17.4  A simple  vector  space  representation  of  documents  derived 
from  the  text  of  the  chapter  and  section  headings  of  Chapters  1,  7,  and  13  in 
three  dimensions. 


document.  For  example,  what  is  important  about  the  speech  dimension  of 
the  Doc7,  is  not  the  value  6 but  rather  that  it  is  the  dominant  contributor  to 
the  meaning  of  that  document.  Similarly,  the  specific  values  of  1,  2,  and  1 
for  Docl  arc  not  important,  what  is  important  is  that  the  three  dimensions 
have  roughly  similar  weights.  It  would  be  sensible,  for  example,  to  assume 
that  a new  document  with  weights  3,  6,  and  3 would  be  quite  similar  to  Docl 
despite  the  magnitude  differences  in  the  term  weights. 

We  can  accomplish  this  effect  by  normalizing  the  document  vectors. 
By  normalizing,  we  simply  mean  converting  all  the  vectors  to  a standard 
length.  Converting  to  a unit  length  can  be  accomplished  by  dividing  each 
of  their  dimensions  by  the  overall  length  of  the  vector,  which  is  defined  as 
Y.j'-i  wj-  This,  in  effect,  eliminates  the  importance  of  the  exact  length  of  a 
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document’s  vector  in  the  space,  and  emphasizes  instead  the  direction  of  the 
document  vector  with  respect  to  the  origin. 

Applying  this  technique  to  our  three  sample  documents  results  in  the 
following  term-by-document  matrix,  A,  where  the  columns  represent  Docl, 
Doc7  and  Docl3  and  the  rows  represent  the  terms  speech , language,  and 
processing. 
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You  should  verify  that  with  this  scheme,  the  normalized  vectors  for  Docl 
and  our  hypothetical  (3,6,3)  document  end  up  as  identical  vectors. 

Now  let’s  return  now  to  the  topic  of  determining  the  similarity  between 
vectors.  Updating  the  similarity  metric  given  earlier  with  numerical  weights 
rather  than  binary  values,  gives  us  the  following  equation. 

N 

s(qk,dj)  = qk-dj  = £ wkk  x wLj  (17.18) 

i=  1 

This  equation  specifies  what  is  known  as  the  dot  product  between  vectors. 
Now,  in  general,  the  dot  product  between  two  vectors  is  not  particularly  use- 
ful as  a similarity  metric,  since  it  is  too  sensitive  to  the  absolute  magnitudes 
of  the  various  dimensions.  However,  the  dot  product  between  vectors  that 
have  been  normalized  has  a useful  and  intuitive  interpretation:  it  computes 
the  cosine  of  the  angle  between  two  vectors.  When  two  documents  are  iden- 
tical they  will  receive  a cosine  of  one;  when  they  arc  orthogonal  (share  no 
common  terms)  they  will  receive  a cosine  of  zero. 

Note  that  if  for  some  reason  the  vectors  arc  not  stored  in  a normalized 
form,  then  the  normalization  can  be  incorporated  directly  into  the  similarity 
measure  as  follows. 


L<ii  X Wij 


s(qk,dj ) = 


2-/=l  WLk 


(17.19) 


'ELt  <j 

Of  course,  in  situations  where  the  document  collection  is  relatively  static  and 
many  queries  are  being  performed,  it  makes  sense  to  normalize  the  document 
vectors  once  and  store  them,  rather  than  include  the  normalization  in  the 
similarity  metric. 

Let’s  consider  how  this  similarity  metric  would  work  in  the  context 
of  some  small  examples.  Consider  the  carefully  selected  query  consisting 
solely  of  the  terms  speech,  language  and  processing.  Converting  this  query 
to  a vector  and  normalizing  it  results  in  the  vector  (.57,  .57,  .57).  Computing 
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the  cosines  between  this  vector  and  our  three  document  vectors  shows  that 
Docl  is  closest  with  a cosine  of  .92,  followed  by  Docl3  with  a cosine  of 
.67,  and  finally  Doc7  with  a cosine  of  .65.  Not  surprisingly,  this  ranking  is 
in  close  accord  with  our  intuitions  about  the  relationship  between  this  query 
and  these  documents. 

Now  consider  a shorter  query  consisting  solely  of  the  terms  speech  and 
processing.  Processing  this  query  yields  the  normalized  vector  (.70,0,  .70). 
When  the  cosines  arc  computed  between  this  vector  and  our  documents, 
Doc7  is  now  the  closest  with  a cosine  of  .80,  followed  by  Docl  with  a score 
of  .58,  with  Docl3  coming  in  a distant  third  with  a cosine  of  .13. 

Term  Weighting 

In  practice,  the  method  used  to  assign  terms  weights  in  the  document  and 
query  vectors  has  an  enormous  impact  on  the  effectiveness  of  a retrieval 
system.  Two  factors  have  proven  to  be  critical  in  deriving  effective  term 
weights:  term  frequency  within  a single  document,  and  the  distribution  of 
terms  across  a collection.  We  can  begin  with  the  simple  notion  that  terms  that 
occur  frequently  within  a document  may  reflect  its  meaning  more  strongly 
than  terms  that  occur  less  frequently  and  should  thus  have  higher  weights. 
In  its  simplest  form,  this  factor  is  called  term  frequency  and  is  simply  the 
raw  frequency  of  a term  within  a document  (Luhn,  1957). 

The  second  factor  to  consider  is  the  distribution  of  terms  across  the  col- 
lection as  a whole.  Terms  that  arc  limited  to  a few  documents  arc  useful  for 
discriminating  those  documents  from  the  rest  of  the  collection.  On  the  other 
hand,  terms  that  occur  frequently  across  the  entire  collection  are  less  useful 
in  discriminating  among  documents.  What  is  needed  therefore  is  a measure 
that  favors  terms  that  occur  in  fewer  documents.  The  fraction  N /rij,  where 
N is  the  total  number  of  documents  in  the  collection,  and  n is  the  number  of 
documents  in  which  term  i occurs,  provides  exactly  this  measure.  The  fewer 
documents  a term  occurs  in,  the  higher  this  weight.  The  lowest  weight  of  1 
is  assigned  to  terms  that  occur  in  all  the  documents.  Due  to  the  large  num- 
ber of  documents  in  many  collections,  this  measure  is  usually  squashed  with 
a log  function  leaving  us  with  the  following  inverse  document  frequency 
term  weight  (Sparck  Jones,  1972). 

idft  = log(-)  (17.20) 

Combining  the  term  frequency  factor  with  this  factor  results  in  a scheme 


TERM 

FREQUENCY 


INVERSE 

DOCUMENT 

FREQUENCY 


Methodology  Box:  Evaluating  Inlormation  Re- 
trieval Systems 


Information  retrieval  systems  arc  evaluated  with  respect  to  the 
notion  of  relevance  — a judgment  by  a human  that  a document  is 
relevant  to  a query.  A system's  ability  to  retrieve  relevant  documents 
is  assessed  with  a recall  measure,  as  in  Chapter  15. 

Recall  = # of  relevant  documents  returned 

total  # of  relevant  documents  in  the  collection 

Of  course,  a system  can  achieve  100%  recall  by  simply  return- 
ing all  the  documents  in  the  collection.  A system's  accuracy  is  based 
on  how  many  of  the  documents  returned  for  a given  query  arc  actu- 
ally relevant,  which  can  be  assessed  by  a precision  metric. 


Precision  = 


# of  relevant  documents  returned 
# of  documents  returned 


These  measures  arc  complicated  by  the  fact  that  most  systems 
do  not  make  explicit  relevance  judgments,  but  rather  rank  their  col- 
lection with  respect  to  a query.  To  deal  with  this  we  can  specify  a 
set  of  cutoffs  in  the  output,  and  measure  average  precision  for  the 
documents  ranked  above  the  cutoff.  Alternatively,  we  can  specify 
a set  of  recall  levels  and  measure  average  precision  at  those  levels. 
This  latter  method  gives  rise  to  what  arc  known  as  precision-recall 
curves  as  shown  in  Figure  17.5.  As  these  curves  show,  comparing 
the  performance  of  two  systems  can  be  difficult.  In  this  comparison, 
one  system  is  better  at  both  high  and  low  levels  of  recall,  while  the 
other  is  better  in  the  middle  region.  An  alternative  to  these  curves 
arc  metrics  that  attempt  to  combine  recall  and  precision  into  a single 
value.  The  F measure  introduced  on  page  576  is  one  such  measure. 

The  U.S.  government  sponsored  TREC  (Text  REtrieval  Confer- 
ence) evaluations  have  provided  a rigorous  testbed  for  the  evalua- 
tion of  a variety  of  information  retrieval  tasks  and  techniques.  Like 
the  MUC  evaluations,  TREC  provides  large  document  sets  for  both 
training  and  testing,  along  with  a uniform  scoring  system.  Train- 
ing materials  consist  of  sets  of  documents  accompanied  by  sets  of 
queries  (called  topics  in  TREC)  and  relevance  judgments.  Voorhees 
and  Harman  (1998)  provides  the  details  for  the  most  recent  meeting. 
Details  of  all  of  the  meetings  can  be  found  at  the  TREC  page  on  the 
National  Institute  of  Standards  and  Technology  Web  site. 
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known  as  tf  ■ id f weighting. 

wtj  = tfij  x idfi  (17.21) 

That  is,  the  weight  of  term  i in  the  vector  for  document  j is  the  product  of 
its  overall  frequency  in  j with  the  log  of  its  inverse  document  frequency  in 
the  collection.  With  some  minor  variations,  this  weighting  scheme  is  used  to 
assign  term  weights  to  documents  in  nearly  all  vector  space  retrieval  models. 

Despite  the  fact  that  we  use  the  same  representations  for  documents 
and  queries,  it  is  not  at  all  clear  that  the  same  weighting  scheme  should  be 
used  for  both.  In  many  ad  hoc  retrieval  settings  such  as  Web  search  engines, 
user  queries  arc  not  very  much  like  documents  at  all.  For  example,  an  analy- 
sis of  a very  large  set  of  queries  (1,000,000,000  actually)  from  the  AltaVista 
search  engine  reveals  that  the  average  query  length  is  around  2.3  words  (Sil- 
verstein  et  ah,  1998).  In  such  an  environment,  the  raw  term  frequency  in  the 
query  is  not  likely  to  be  a very  useful  factor.  Instead,  Salton  and  Buckley 
(1988)  recommend  the  following  formula  for  weighting  query  terms,  where 
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M'dX  jt  fj  k denotes  the  frequency  of  the  most  frequent  term  in  document  k. 

wik  = (o.5  + °'5tfi,k  x idfi  (17.22) 

V Ma  XjtfjJ 

Term  Selection  and  Creation 

We  have  been  assuming  thus  far  that  it  is  precisely  the  words  that  occur  in 
a collection  that  will  be  used  to  index  the  documents  in  the  collection.  Two 
common  variations  on  this  assumption  involve  the  use  of  stemming,  and  a 
stop  list. 

stemming  The  notion  of  stemming  takes  us  back  to  Chapter  3 and  the  topic  mor- 

phological analysis.  The  basic  question  addressed  by  stemming  is  whether 
the  morphological  valiants  of  a lexical  item  should  be  listed  (and  counted) 
separately,  or  whether  they  should  be  collapsed  into  a single  root  form.  For 
example,  without  stemming,  the  terms  process,  processing  and  processed 
will  be  treated  as  distinct  items  with  separate  term  frequencies  in  a term-by- 
document  matrix;  with  stemming  they  will  be  conflated  to  the  single  term 
process  with  a single  summed  frequency  count.  The  major  advantage  to  us- 
ing stemming  is  that  it  allows  a particular  query  term  to  match  documents 
containing  any  of  the  morphological  valiants  of  the  term.  The  Porter  stem- 
mer  (Porter,  1980)  described  Chapter  3 is  the  system  most-used  for  this  pur- 
pose retrieval  from  collections  of  English  documents. 

A significant  problem  with  this  approach  is  that  it  throws  away  useful 
distinctions.  For  example,  consider  the  use  of  the  Porter  stemmer  on  docu- 
ments and  queries  containing  the  words  stocks  and  stockings.  In  this  case, 
the  Porter  stemmer  reduces  these  surface  forms  to  the  single  term  stock.  Of 
course,  the  result  of  this  is  that  queries  concerning  stock  prices  will  return 
documents  about  stockings,  and  queries  about  stockings  will  find  documents 
about  stocks.  2 More  technically,  stemming  may  increase  recall  by  find- 
ing documents  with  terms  that  are  morphologically  related  to  queries,  but  it 
may  also  reduce  precision  by  returning  semantically  unrelated  documents. 
For  this  reason,  few  Web  search  engines  currently  make  use  of  stemming. 
Frakes  and  Baeza-Yates  (1992)  presents  results  from  a series  of  experiments 
that  explore  the  efficacy  of  stemming. 

A second  common  technique  is  the  use  of  stop  lists,  which  address 

2 This  example  is  motivated  by  some  bad  publicity  received  by  a well-known  search  engine, 
when  it  returned  some  rather  salacious  sites  containing  extensive  use  of  the  term  stockings  in 
response  to  queries  concerning  stock  prices.  In  response,  a spokesman  announced  that  their 
engineers  were  working  hard  on  a solution  to  this  strange  problem  with  words. 
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the  issue  of  what  words  should  be  allowed  into  the  index.  A stop  list  is  stop  list 
a list  of  high  frequency  words  that  arc  eliminated  from  the  representation 
of  both  documents  and  queries.  Two  motivations  arc  normally  given  for 
this  strategy:  high  frequency,  closed-class,  terms  arc  seen  as  carrying  little 
semantic  weight  and  arc  thus  unlikely  to  help  with  retrieval,  and  eliminating 
them  can  save  considerable  space  in  the  inverted  index  files  used  to  map  from 
terms  to  the  documents  that  contain  them.  The  downside  of  using  a stop  list 
is  that  it  makes  it  difficult  to  search  for  phrases  that  contain  words  in  the 
stop  list.  For  example,  a common  stop  list  derived  from  the  Brown  corpus 
presented  in  (Frakes  and  Baeza- Yates,  1992),  would  reduce  the  phrase  to  be 
or  not  to  be  to  the  phrase  not. 

Homonymy,  Polysemy  and  Synonymy 

Since  the  vector  space  model  is  based  solely  on  the  use  of  simple  terms,  its 
useful  to  consider  the  effect  that  various  lexical  semantic  phenomena  have  on 
the  model.  Consider  a query  containing  the  word  canine  with  its  tooth  and 
dog  senses.  A query  containing  canine  will  be  judged  similar  to  documents 
making  use  of  either  of  these  senses.  However,  given  that  users  arc  probably 
only  interested  in  one  of  these  senses,  the  documents  containing  the  other 
sense  will  be  judged  non-relevant.  Homonymy  and  polysemy,  therefore, 
have  the  effect  of  reducing  precision  by  leading  a system  to  return  documents 
irrelevant  to  the  users  information  need. 

Now  consider  a query  consisting  of  the  lexeme  dog.  This  query  will 
be  judged  close  to  documents  that  make  frequent  use  of  the  term  dog,  but 
may  fail  to  match  documents  that  use  close  synonyms  like  canine,  as  well  as 
documents  that  use  hyponyms  such  as  malamute.  Synonymy  and  hyponymy, 
therefore,  have  the  effect  of  reducing  recall  by  causing  the  retrieval  system 
to  miss  relevant  documents. 

Note  that  it  is  inaccurate  to  state  flatly  that  that  polysemy  reduces  preci- 
sion, and  synonymy  reduces  recall  since,  as  we  discussed  on  page  648,  both 
measures  arc  relative  to  a fixed  cutoff.  As  a result,  every  non-relevant  docu- 
ment that  rises  above  the  cutoff  due  to  polysemy  takes  up  a slot  in  the  fixed 
size  return  set,  and  may  thus  push  a relevant  document  below  threshold  thus 
reducing  recall.  Similarly,  when  a document  is  missed  due  to  synonymy, 
a slot  is  opened  in  the  return  set  for  a non-relevant  document,  potentially 
reducing  precision  as  well. 

Not  surprisingly,  these  issues  lead  to  the  question  of  whether  or  not 
word  sense  disambiguation  can  help  in  information  retrieval.  The  evidence 
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on  this  point  is  mixed,  with  some  experiments  reporting  a sizable  gain  using 
disambiguation  (Schiitze  and  Pedersen,  1995),  and  others  reporting  either  no 
gain,  or  a degradation  in  performance  (Krovetz  and  Croft,  1992;  Voorhees, 
1998). 

Improving  User  Queries 

One  of  the  most  effective  ways  to  improve  retrieval  performance  is  to  find  a 
way  to  improve  user  queries.  The  techniques  presented  in  this  section  have 
been  shown  to  varying  degrees  to  be  effective  at  this  task. 

The  single  most  effective  way  to  improve  retrieval  performance  in  the 
vector  space  model  is  the  use  of  relevance  feedback  (Rocchio,  1971).  In 
this  method,  a user  presents  a query  to  the  system  and  is  presented  with  a 
small  set  of  retrieved  documents.  The  user  is  then  asked  to  specify  which 
of  these  documents  appeal's  relevant  to  their  need.  The  user’s  original  query 
is  then  reformulated  based  on  the  distribution  of  terms  in  the  relevant  and 
non-relevant  documents  that  the  user  examined.  This  reformulated  query  is 
then  passed  to  the  system  as  a new  query  with  the  new  results  being  shown  to 
the  user.  Typically  an  enormous  improvement  is  seen  after  a single  iteration 
of  this  technique. 

The  formal  basis  for  the  implementation  of  this  technique  falls  out  di- 
rectly from  some  of  the  basic  geometric  intuitions  of  the  vector  model.  In 
particular,  we  would  like  to  push  the  vector  representing  the  user’s  origi- 
nal query  toward  the  documents  that  have  been  found  to  be  relevant,  and 
away  from  the  documents  judged  not  relevant.  This  can  be  accomplished  by 
adding  an  averaged  vector  representing  the  relevant  documents  to  the  origi- 
nal query,  and  subtracting  an  averaged  vector  representing  the  non-relevant 
queries. 

More  formally,  let’s  assume  that  q,  represents  the  user’s  original  query, 
R is  the  number  of  relevant  documents  returned  from  the  original  query,  and 
N is  the  number  of  non-relevant  documents.  In  addition,  assume  that  [3  and  y 
range  from  0 to  1 and  that  (3 + y = 1 . Given  these  assumptions,  the  following 
represents  a standard  relevance  feedback  update  formula. 

P R _ y N ^ 

1 = ^ ^ir  ~ din 

K 7—1  V k= 1 

The  factors  beta  and  y in  this  formula  represent  parameters  that  can 
be  adjusted  experimentally.  Intuitively,  they  represent  how  far  the  original 
vector  should  be  pushed  towards  the  relevant  documents  or  away  from  the 
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non-relevant  ones.  Salton  and  Buckley  (1990)  report  good  results  with  [3  = 
.75  and  y=  .25. 

We  should  note  that  evaluating  systems  that  use  relevance  feedback  is 
rather  tricky.  In  particular,  an  enormous  improvement  is  often  seen  in  the 
documents  retrieved  by  the  first  reformulated  query.  This  should  not  be  too 
surprising  since  it  includes  the  documents  that  the  user  has  told  the  system 
were  relevant.  The  preferred  way  to  avoid  this  inflation  is  to  only  compute 
recall  and  precision  measures  for  what  is  called  the  residual  collection,  the 
original  collection  without  any  of  the  documents  shown  to  the  user  on  any 
previous  round.  This  usually  has  the  effect  of  driving  the  system's  raw  per- 
formance below  that  achieved  with  the  first  query,  since  the  most  highly  rele- 
vant documents  have  now  been  eliminated.  Nevertheless,  this  is  an  effective 
technique  to  use  when  comparing  distinct  relevance  feedback  mechanisms. 

An  alternative  approach  to  query  improvement  focuses  on  the  terms 
that  comprise  the  query  vector,  rather  than  the  query  vector  itself.  In  query 
expansion,  the  users  original  query  is  expanded  to  include  terms  related  to 
the  original  terms.  This  has  typically  been  accomplished  by  adding  adding 
terms  chosen  from  lists  of  terms  that  arc  highly  correlated  with  the  user’s 
original  terms  in  the  collection.  Such  highly  correlated  terms  arc  listed  in 
what  is  typically  called  a thesaurus,  although  since  it  is  based  on  correlation, 
rather  than  synonymy,  it  is  only  loosely  connected  to  the  standard  references 
that  carry  the  same  name. 

Unfortunately,  it  is  usually  the  case  that  available  thesaurus-like  re- 
sources are  not  suitable  for  most  collections.  In  thesaurus  generation,  a 
correlation-based  thesaurus  is  generated  automatically  from  all  or  a portion 
of  the  documents  in  the  collection.  Not  surprisingly,  one  of  the  most  popular 
methods  used  in  thesaurus  generation  involves  the  use  of  term  clustering. 
Recall,  from  our  characterization  of  the  term-by-document  matrix  that  the 
columns  in  the  matrix  represent  the  documents  and  the  rows  represent  the 
terms.  Therefore,  in  thesaurus  generation,  the  rows  can  be  clustered  to  form 
sets  of  synonyms,  which  can  then  be  added  to  the  user’s  original  query  to 
improve  its  recall. 

This  technique  is  typically  instantiated  in  one  of  two  ways:  a thesaurus 
can  be  generated  once  from  the  document  collection  as  a whole  (Crouch  and 
Yang,  1992),  or  sets  of  synonym-like  terms  can  be  generated  dynamically 
from  the  returned  set  for  the  original  query  (Attar  and  Fraenkel,  1977).  Note 
that  this  second  approach  entails  far  more  effort,  since  in  effect  a small  the- 
saurus is  generated  for  the  documents  returned  for  every  query,  rather  than 
once  for  entire  collection. 
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Other  Information  Retrieval  Tasks 


As  noted  earlier,  ad-hoc  retrieval  is  not  the  only  word-based  task  in  infor- 
mation retrieval.  Some  of  the  other  more  important  ones  include  document 
categorization,  document  clustering,  and  text  segmentation. 

The  categorization  task  is  to  assign  a new  document  to  one  of  a pre- 
existing set  of  document  classes.  In  this  setting,  the  task  of  creating  a clas- 
sifier consists  of  discovering  a useful  characterization  of  the  documents  that 
belong  in  each  class.  Although  this  can  be  done  by  hand,  the  principal  way 
to  approach  this  problem  is  to  use  supervised  machine  learning.  In  particu- 
lar, classifiers  can  be  trained  on  a set  of  documents  that  have  been  labeled 
with  the  correct  class.  Not  surprisingly,  all  the  supervised  learning  methods 
introduced  on  page  634  for  word  sense  disambiguation  can  be  applied  to  this 
task  as  well. 

When  categorization  is  performed  with  the  intent  of  then  transmitting 
the  document  to  a user  or  set  of  interested  users  it  is  usually  referred  to  as 
filtering  or  routing.  An  interesting  example  of  this  is  AT&T’s  'How  May 
I Help  You’  task  where  the  goal  is  to  classify  a user’s  utterance  into  one 
of  fifteen  possible  categories,  such  as  third  number  billing,  or  collect  call. 
Once  the  system  has  classified  the  call,  the  system  routes  the  caller  to  an 
appropriate  human  operator.  This  task  provides  a good  example  of  the  need 
for  in  vivo  evaluation  mentioned  earlier.  The  classification  accuracy  on  this 
task  approaches  80  %,  despite  the  fact  that  the  speech  recognizer  has  a word 
accuracy  rate  of  only  around  50  % (Gorin  et  at,  1997). 

The  categorization  task  assumes  an  existing  classification,  or  cluster- 
ing, of  documents.  By  contrast,  the  task  of  document  clustering  is  to  create, 
or  discover,  a reasonable  set  of  clusters  for  a given  set  of  documents.  As  was 
the  case  word  sense  discovery,  a reasonable  cluster  is  defined  as  one  that 
maximizes  the  within-cluster  document  similarity,  and  minimizes  between- 
cluster  similarity.  There  arc  two  principal  motivations  for  the  use  of  this 
technique  in  an  ad  hoc  retrieval  setting:  efficiency,  and  the  cluster  hypothe- 
sis. 

The  efficiency  motivation  arises  from  the  enormous  size  of  many  mod- 
ern document  collections.  Recall  that  the  retrieval  method  described  in  the 
last  section  requires  every  query  to  be  compared  against  every  document  in 
the  collection.  If  a collection  can  be  divided  up  into  a set  of  N conceptually 
coherent  clusters,  then  queries  could  first  be  compared  against  representa- 
tions of  each  of  the  N clusters.  Ordinary  retrieval  could  then  be  applied  only 
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within  the  top  cluster  or  clusters,  thus  saving  the  cost  of  comparing  the  query 
to  the  documents  in  all  of  the  other  more  distant  clusters. 

The  cluster  hypothesis  (Jardine  and  van  Rijsbergen,  1971)  takes  this 
argument  a step  further  by  asserting  that  retrieval  from  a clustered  collection 
will  not  only  be  more  efficient,  but  will  in  fact  improve  retrieval  performance 
in  terms  of  recall  and  precision.  The  basic  notion  behind  this  hypothesis  is 
that  by  separating  documents  according  to  topic,  relevant  documents  will 
be  found  together  in  the  same  cluster,  and  non-relevant  documents  will  be 
avoided  since  they  will  be  reside  in  clusters  that  arc  not  used  for  retrieval. 
Despite  the  plausibility  of  this  hypothesis,  there  is  only  mixed  experimental 
support  for  it.  Results  vary  considerably  based  on  the  clustering  algorithm 
and  document  collection  in  use  (Willett,  1988;  Shaw  et  ah,  1996). 

Finally,  in  text  segmentation,  larger  documents  arc  automatically  bro- 
ken down  into  smaller  semantically  coherent  chunks.  This  is  useful  in  do- 
mains where  there  arc  a significant  number  of  large  documents  that  cover 
a wide  variety  of  topics.  Text  segmentation  can  be  used  to  either  perform 
retrieval  below  the  document  level,  or  to  visually  guide  the  user  to  relevant 
parts  of  retrieved  documents.  Again,  not  surprisingly,  segmentation  algo- 
rithms often  make  use  of  vector-like  representations  for  the  subparts  of  a 
larger  document.  Adjacent  subparts  that  have  similar  cosines  arc  more  likely 
to  about  the  same  topic  than  adjacent  segments  with  more  distant  cosines. 
Roughly  speaking,  such  discontinuities  in  the  similarity  between  adjacent 
text  segments  can  be  used  to  divide  larger  documents  into  subparts  (Salton 
etal,  1993;  Hearst,  1997). 


17.5  Summary 

This  chapter  has  explored  two  major  areas  of  lexical  semantic  processing: 
word  sense  disambiguation  and  information  retrieval. 

• Word  sense  disambiguation  systems  assign  word  tokens  in  context  to 
one  of  a pre-specified  set  of  senses. 

• Selection  restriction-based  approaches  can  be  used  to  disambiguate 
both  predicates  and  arguments. 

• Selection  restriction-based  methods  require  considerable  information 
about  semantic  roles  restrictions  and  hierarchical  type  information  about 
role  fillers. 


CLUSTER 

HYPOTHESIS 


TEXT  SEG- 
MENTATION 


656 


Chapter  17.  Word  Sense  Disambiguation  and  Information  Retrieval 


• Machine  learning  approaches  to  sense  disambiguation  make  it  possible 
to  automatically  create  robust  sense  disambiguation  systems. 

• Supervised  approaches  use  collections  of  texts  annotated  with  their 
correct  senses  to  train  classifiers. 

• Bootstrapping  approaches  permit  the  use  of  supervised  methods  with 
far  fewer  resources. 

• Unsupervised,  clustering-based,  approaches  attempt  to  discover  repre- 
sentations of  word  senses  from  unannotated  texts. 

• Machine  readable  dictionaries  facilitate  the  creation  of  broad-coverage 
sense  disambiguators. 

• The  dominant  models  of  information  retrieval  represent  the  meanings 
of  documents  and  queries  as  bags  of  words. 

• The  vector  space  model  views  documents  and  queries  as  vectors  in  a 
large  multidimensional  space. 

• The  similarity  between  documents  and  queries,  or  other  documents, 
can  be  measured  by  the  cosine  of  the  angle  between  the  vectors. 

• The  values  of  the  features  of  vectors  is  based  on  a combination  of  the 
frequency  of  terms  within  a document  and  the  distribution  of  terms 
across  the  document. 

• Polysemy  and  synonymy  wreak  havoc  with  word-based  information 
retrieval  systems,  reducing  both  precision  and  recall. 

• User  queries  can  be  improved  through  query  reformulation  using  either 
relevance  feedback  or  thesaurus-based  query  expansion. 


Bibliographical  and  Historical  Notes 

Word  sense  disambiguation  traces  its  roots  to  some  of  the  earliest  applica- 
tions of  digital  computers.  The  notion  of  disambiguating  a word  by  looking 
at  small  window  around  it  was  apparently  first  suggested  by  Warren  Weaver 
(1955b),  in  the  context  of  machine  translation.  Among  the  notions  first  pro- 
posed in  this  early  period  were  the  use  of  a thesaurus  for  disambiguation 
(Masterman,  1957),  supervised  training  of  Bayesian  models  for  disambigua- 
tion (Madhu  and  Lytel,  1965),  and  the  use  of  clustering  in  word  sense  anal- 
ysis (Sparck  Jones,  1986). 

An  enormous  amount  of  work  on  disambiguation  has  been  conducted 
within  the  context  of  Al-oriented  natural  language  processing  systems.  It  is 
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fair  to  say  that  most  natural  language  analysis  systems  of  this  type  exhibit 
some  form  of  lexical  disambiguation  capability.  However,  a number  of  these 
efforts  made  word  sense  disambiguation  a larger  focus  of  their  work.  Among 
the  most  influential  efforts  were  the  efforts  of  Quillian  (1968)  and  Simmons 
(1973b)  with  semantic  networks,  the  work  of  Wilks  with  Preference  Seman- 
tics (Wilks,  1975c,  1975b,  1975a)ks75,  and  the  work  of  Small  and  Rieger 
(1982)  and  Riesbeck  (1975)  on  word-based  understanding  systems.  Hirst’s 
ABSITY  system  (Hirst  and  Charniak,  1982;  Hirst,  1986,  1988),  which  used 
a technique  based  on  semantic  networks  called  marker  passing,  represents 
the  most  advanced  system  of  this  type.  As  with  these  largely  symbolic  ap- 
proaches, most  connectionist  approaches  to  word  sense  disambiguation  have 
relied  on  small  lexicons  with  hand-coded  representations  (Cottrell,  1985; 
Kawamoto,  1988). 

We  should  note  that  considerable  work  on  sense  disambiguation  has 
been  conducted  in  the  areas  of  Cognitive  Science  and  psycholinguistics.  Ap- 
propriately enough,  it  is  generally  described  using  a different  name:  lexical 
ambiguity  resolution.  Small  et  al.  (1988)  present  a variety  of  papers  from 
this  perspective. 

The  earliest  implementation  of  a robust  empirical  approach  to  sense 
disambiguation  is  due  to  Kelly  and  Stone  (1975)  who  directed  a team  of 
that  hand-crafted  a set  of  disambiguation  rules  for  1790  ambiguous  English 
words.  Lesk  (1986)  was  the  first  to  use  a machine  readable  dictionary  for 
word  sense  disambiguation.  The  efforts  at  New  Mexico  State  University 
using  LDOCE  arc  among  the  most  extensive  explorations  of  the  use  of  ma- 
chine readable  dictionaries.  Much  of  this  work  is  described  in  (Wilks  et  ah, 
1996).  The  problem  of  dictionary  senses  being  too  fine-grained  or  lacking 
an  appropriate  organization  has  been  addressed  in  the  work  of  (Dolan,  1994) 
and  (Chen  and  Chang,  1998). 

Modern  interest  in  supervised  machine  learning  approaches  to  disam- 
biguation began  with  Black  (1988),  who  applied  decision  tree  learning  to  the 
task.  The  need  for  large  amounts  of  annotated  text  in  these  methods  led  to  in- 
vestigations into  the  use  of  bootstrapping  methods  (Hearst,  1991;  Yarowsky, 
1995).  The  problem  of  how  to  weight  and  combine  the  disparate  sources  of 
evidence  used  in  many  robust  systems  is  explored  in  (Ng  and  Lee,  1996)  and 
(McRoy,  1992).  There  has  been  considerably  less  work  in  the  area  of  unsu- 
pervised methods.  The  earliest  attempt  attempt  to  use  clustering  in  the  study 
of  word  senses  is  due  to  (Sparck  Jones,  1986).  Zernik  (1991)  successfully 
applied  a standard  information  retrieval  clustering  algorithm  to  the  problem, 
and  provided  an  evaluation  based  on  improvements  in  retrieval  performance. 
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More  extensive  recent  work  on  clustering  can  be  found  in  (Pedersen  and 
Bruce,  1997;  Schiitze,  1997,  1998). 

Note  that  of  all  of  these  robust  efforts,  only  three  have  attempted  to  ex- 
ploit the  power  of  mutually  disambiguating  all  the  words  in  a sentence.  The 
system  described  in  (Kelly  and  Stone,  1975)  makes  multiple  passes  over  a 
sentence  to  take  later  advantage  of  easily  disambiguated  words;  Cowie  et  al, 
(1992)  use  a simulated  annealing  model  to  perform  a parallel  search  for  a 
desirable  set  of  senses;  Veronis  and  Ide  (1990)  use  inhibition  and  excita- 
tion in  a neural  network  automatically  constructed  from  a machine  readable 
dictionary. 

Ide  and  Veronis  (1998)  provide  a comprehensive  review  of  the  history 
and  current  state  of  word  sense  disambiguation.  (Ng  and  Zelle,  1997)  pro- 
vide a more  focused  review  from  a machine  learning  perspective.  Wilks 
et  al.  (1996)  describe  a wide  array  of  dictionary  and  corpus-based  experi- 
ments, along  with  detailed  descriptions  of  some  very  early  work. 

Luhn  (1957)  is  generally  credited  with  first  advancing  the  notion  of 
fully  automatic  indexing  of  documents  based  on  their  contents.  Over  the 
years  Salton’s  SMART  project  (Salton,  1971)  at  Cornell  developed  or  eval- 
uated many  of  the  most  important  notions  in  information  retrieval  including 
the  vector  model,  term  weighting  schemes,  relevance  feedback,  and  the  use 
of  cosine  as  a similarity  metric.  The  notion  of  using  inverse  document  fre- 
quency in  term  weighting  is  due  to  (Sparck  Jones,  1972).  The  original  notion 
of  relevance  feedback  is  due  to  (Rocchio,  1971).  An  alternative  to  the  vec- 
tor model  that  we  have  not  covered  is  the  probabilistic  model.  Originally 
shown  effective  by  Robinson  and  Sparck  Jones  (1976),  a Bayesian  network 
version  of  the  probabilistic  model  is  the  basis  for  the  widely  used  INQUERY 
system  (Callan  et  al.,  1992). 

The  cluster  hypothesis  was  introduced  in  (Jardine  and  van  Rijsber- 
gen,  1971).  Willett  (1988)  provides  a critical  review  of  the  major  efforts 
in  this  area.  Mather  (1998)  presents  an  algorithm-independent  clustering 
metric  that  can  be  used  to  evaluate  the  performance  of  various  clustering  al- 
gorithms. A collection  of  papers  on  document  categorization  and  its  close 
siblings,  filtering  and  routing,  can  be  found  in  (Lewis  and  Hayes,  1994).  Text 
segmentation  has  generally  been  investigated  from  one  of  two  perspectives: 
approaches  based  on  strong  theories  of  discourse  structure,  and  approaches 
based  on  lexical  text  cohesion  (Morris  and  Hirst,  1991).  Hearst  (1997)  de- 
scribes a robust  technique  based  on  a vector  model  of  lexical  cohesion.  Tech- 
niques based  on  strong  discourse-models  are  discussed  in  Chapter  18  and 
Chapter  20. 
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An  important  extension  of  the  vector  space  model  known  as  Latent 
Semantic  Indexing  (LSI)  (Deerwester  et  al.,  1990)  uses  the  singular  value  sSwItic 
decomposition  method  as  means  of  reducing  the  dimensionality  of  vector 
models  with  the  intent  of  discovering  higher-order  regularities  in  the  original 
term-by-document  matrix.  Although  LSI  began  life  as  a retrieval  method,  it 
has  been  applied  to  a wide  variety  of  applications  including  models  of  lexical 
acquisition  (Landauer  and  Dumais,  1997),  question  answering  (Jones,  1997), 
and  most  recently,  student  essay  grading  (Landauer  et  ah,  1997). 

Baeza- Yates  and  Ribeiro-Neto  (1999)  is  a comprehensive  text  cover- 
ing many  of  newest  advances  and  trends  in  information  retrieval.  Frakes 
and  Baeza-Yates  (1992)  is  a more  nuts  and  bolts  text  which  includes  a con- 
siderable amount  of  useful  C code.  Older  classic  texts  include  (Salton  and 
McGill,  1983)  and  (van  Rijsbergen,  1975).  (Sparck  Jones  and  Willett,  1997) 
includes  many  of  the  classic  papers  in  the  field.  Current  work  is  often  pub- 
lished in  the  annual  proceedings  of  the  ACM  Special  Interest  Group  on  In- 
formation Retrieval  (SIGIR).  The  periodic  TREC  conference  proceedings 
contain  results  from  standardized  evaluations  organized  by  the  U.S.  govern- 
ment. The  primary  journals  in  the  field  arc  the  Journal  of  the  American 
Society  of  Information  Sciences,  ACM  Transactions  on  Information  Systems, 
Information  Processing  and  Management,  and  Information  Retrieval. 
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PRAGMATICS 


Pragmatics  is  the  study  of  (some  parts  of)  the  relation  between 
language  and  context-of-use.  Context-of-use  includes  such  things  as 
the  identities  of  people  and  objects,  and  so  pragmatics  includes  stud- 
ies of  how  language  is  used  to  refer  (and  re-refer)  to  people  and  things. 
Context-of-use  includes  the  discourse  context,  and  so  pragmatics  in- 
cludes studies  of  how  discourses  are  structured,  and  how  the  listener 
manages  to  interpret  a conversational  partner  in  a conversation.  This 
section  explores  algorithms  for  reference  resolution,  computational 
models  for  recovering  the  structure  of  monologue  and  conversational 
discourse,  and  models  of  how  utterances  in  dialog  are  interpreted. 
This  section  also  discusses  the  role  of  each  of  these  models  in  build- 
ing a conversational  agent,  as  well  as  the  design  of  the  dialog  manager 
component  of  such  an  agent.  Finally,  the  section  introduces  natural 
language  generation,  focusing  especially  on  the  function  of  discourse. 


DISCOURSE* 


Gracie:  Oh  yeah...  And  then  Mr.  and  Mrs.  Jones  were  having 
matrimonial  trouble,  and  my  brother  was  hired  to  watch  Mrs. 
Jones. 

George:  Well,  I imagine  she  was  a very  attractive  woman. 
Gracie:  She  was,  and  my  brother  watched  her  day  and  night  for 
six  months. 

George:  Well,  what  happened? 

Gracie:  She  finally  got  a divorce. 

George:  Mrs.  Jones? 

Gracie:  No,  my  brother’s  wife. 

George  Burns  and  Gracie  Allen  in  The  Salesgirl 


Up  to  this  point  of  the  book,  we  have  focused  primarily  on  language 
phenomena  that  operate  at  the  word  or  sentence  level.  Of  course,  language 
does  not  normally  consist  of  isolated,  unrelated  sentences,  but  instead  of 
collocated,  related  groups  of  sentences.  We  refer  to  such  a group  of  sentences 
as  a discourse. 

The  chapter  you  arc  now  reading  is  an  example  of  a discourse.  It  is  in 
fact  a discourse  of  a particular  sort:  a monologue.  Monologues  arc  charac- 
terized by  a speaker  (a  term  which  will  be  used  to  include  writers,  as  it  is 
here),  and  a hearer  (which,  analogously,  includes  readers).  The  communi- 
cation flows  in  only  one  direction  in  a monologue,  that  is,  from  the  speaker 
to  the  hearer. 

After  reading  this  chapter,  you  may  have  a conversation  with  a friend 
about  it,  which  would  consist  of  a much  freer  interchange.  Such  a discourse 
is  called  a dialogue.  In  this  case,  each  participant  periodically  takes  turns 
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being  a speaker  and  hearer.  Unlike  a typical  monologue,  dialogues  generally 
consist  of  many  different  types  of  communicative  acts:  asking  questions, 
giving  answers,  making  corrections,  and  so  forth. 

Finally,  computer  systems  exist  and  continue  to  be  developed  that  al- 
hci  low  for  human-computer  interaction,  or  HCI.  HCI  has  properties  that  distin- 

guish it  from  normal  human-human  dialogue,  in  paid  due  to  the  present-day 
limitations  on  the  ability  of  computer  systems  to  participate  in  free,  uncon- 
strained conversation.  A system  capable  of  HCI  will  often  employ  a strategy 
to  constrain  the  conversation  in  ways  that  allow  it  to  understand  the  user’s 
utterances  within  a limited  context  of  interpretation. 

While  many  discourse  processing  problems  arc  common  to  these  three 
forms  of  discourse,  they  differ  in  enough  respects  that  different  techniques 
have  often  been  used  to  process  them.  This  chapter  focuses  on  techniques 
commonly  applied  to  the  interpretation  of  monologues;  techniques  for  dia- 
logue interpretation  and  HCI  will  be  described  in  Chapter  19. 

Language  is  rife  with  phenomena  that  operate  at  the  discourse  level. 
Consider  the  discourse  shown  in  example  (18.1). 

(18. 1)  John  went  to  Bill’s  car  dealership  to  check  out  an  Acura  Integra.  He 
looked  at  it  for  about  an  hour. 

What  do  pronouns  such  as  he  and  it  denote?  No  doubt  that  the  reader  had 
little  trouble  figuring  out  that  he  denotes  John  and  not  Bill,  and  that  it  denotes 
the  Integra  and  not  Bill’s  car  dealership.  On  the  other  hand,  toward  the  end 
of  the  exchange  presented  at  the  beginning  of  this  chapter,  it  appeal's  that 
George  had  some  trouble  figuring  out  who  Gracie  meant  when  saying  she. 

What  differentiates  these  two  examples?  How  do  hearers  interpret  dis- 
course (18.1)  with  such  ease?  Can  we  build  a computational  model  of  this 
process?  These  are  the  types  of  questions  we  address  in  this  chapter.  In  Sec- 
tion 18.1,  we  describe  methods  for  interpreting  referring  expressions  such  as 
pronouns.  We  then  address  the  problem  of  establishing  the  coherence  of  a 
discourse  in  Section  18.2.  Finally,  in  Section  18.3  we  explain  methods  for 
determining  the  structure  of  a discourse. 

Because  discourse-level  phenomena  are  ubiquitous  in  language,  algo- 
rithms for  resolving  them  are  essential  for  a wide  range  of  language  appli- 
cations. For  instance,  interactions  with  query  interfaces  and  dialogue  inter- 
pretation systems  like  ATIS  (see  Chapter  9)  frequently  contain  pronouns  and 
similar  types  of  expressions.  So  when  a user  spoke  passage  (18.2)  to  an  ATIS 
system. 
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(18.2)  I’d  like  to  get  from  Boston  to  San  Francisco,  on  either  December  5th 
or  December  6th.  It's  okay  if  it  stops  in  another  city  along  the  way. 

the  system  had  to  figure  out  that  it  denotes  the  flight  that  the  user  wants  to 
book  in  order  to  perform  the  appropriate  action. 

Similarly,  information  extraction  systems  (see  Chapter  15)  must  fre- 
quently extract  information  from  utterances  that  contain  pronouns.  For  in- 
stance, if  an  information  extraction  system  is  confronted  with  passage  (18.3), 

(18.3)  First  Union  Corp  is  continuing  to  wrestle  with  severe  problems 
unleashed  by  a botched  merger  and  a troubled  business  strategy. 

According  to  industry  insiders  at  Paine  Webber,  their  president,  John 
R.  Georgius,  is  planning  to  retire  by  the  end  of  the  year. 

it  must  correctly  identify  First  Union  Corp  as  the  denotation  of  their  (as 
opposed  to  Paine  Webber,  for  instance)  in  order  to  extract  the  correct  event. 

Likewise,  many  text  summarization  systems  employ  a procedure  for 
selecting  the  important  sentences  from  a source  document  and  using  them 
to  form  a summary.  Consider,  for  example,  a news  article  that  contains  pas- 
sage (18.3).  Such  a system  might  determine  that  the  second  sentence  is 
important  enough  to  be  included  in  the  summary,  but  not  the  first.  How- 
ever, the  second  sentence  contains  a pronoun  that  is  dependent  on  the  first 
sentence,  so  it  cannot  place  the  second  sentence  in  the  summary  without  first 
determining  the  pronoun’s  denotation,  as  the  pronoun  would  otherwise  likely 
receive  a different  interpretation  within  the  summary.  Similarly,  natural  lan- 
guage generation  systems  (see  Chapter  20)  must  have  adequate  models  for 
pronominalization  to  produce  coherent  and  interpretable  discourse.  In  short, 
just  about  any  conceivable  language  processing  application  requires  methods 
for  determining  the  denotations  of  pronouns  and  related  expressions. 

18.1  Reference  Resolution 

In  this  section  we  study  the  problem  of  reference,  the  process  by  which  reference 
speakers  use  expressions  like  John  and  he  in  passage  (18.1)  to  denote  a per- 
son named  John.  Our  discussion  requires  that  we  first  define  some  termi- 
nology. A natural  language  expression  used  to  perform  reference  is  called  a 
referring  expression,  and  the  entity  that  is  referred  to  is  called  the  referent,  expression 
Thus,  John  and  he  in  passage  (18.1)  arc  referring  expressions,  and  John  is  referent 
their  referent.  (To  distinguish  between  referring  expressions  and  their  refer- 
ents, we  italicize  the  former.)  As  a convenient  shorthand,  we  will  sometimes 
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speak  of  a referring  expression  referring  to  a referent,  e.g.,  we  might  say 
that  he  refers  to  John.  However,  the  reader  should  keep  in  mind  that  what 
we  really  mean  is  that  the  speaker  is  performing  the  act  of  referring  to  John 
by  uttering  he.  Two  referring  expressions  that  arc  used  to  refer  to  the  same 
corefer  entity  arc  said  to  corefer,  thus  John  and  he  corefer  in  passage  (18.1).  There 
is  also  a term  for  a referring  expression  that  licenses  the  use  of  another,  in 
the  way  that  the  mention  of  John  allows  John  to  be  subsequently  referred  to 
antecedent  using  he.  We  call  John  the  antecedent  of  he.  Reference  to  an  entity  that 
anaphora  has  been  previously  introduced  into  the  discourse  is  called  anaphora,  and 
anaphoric  the  referring  expression  used  is  said  to  be  anaphoric.  In  passage  (18.1),  the 
pronouns  he  and  it  arc  therefore  anaphoric. 

Natural  languages  provide  speakers  with  a variety  of  ways  to  refer  to 
entities.  Say  that  your  friend  has  an  Acura  Integra  automobile  and  you  want 
context^  1°  refer  to  it.  Depending  on  the  operative  discourse  context,  you  might 
say  it,  this,  that,  this  car,  that  car,  the  car,  the  Acura,  the  Integra,  or  my 
friend’s  car , among  many  other  possibilities.  However,  you  arc  not  free  to 
choose  between  any  of  these  alternatives  in  any  context.  For  instance,  you 
cannot  simply  say  it  or  the  Acura  if  the  hearer  has  no  prior  knowledge  of  your 
friend's  car,  it  has  not  been  mentioned  before,  and  it  is  not  in  the  immediate 
context  AL  surroundings  of  the  discourse  participants  (i.e.,  the  situational  context  of 
the  discourse). 

The  reason  for  this  is  that  each  type  of  referring  expression  encodes  dif- 
ferent signals  about  the  place  that  the  speaker  believes  the  referent  occupies 
within  the  hearer’s  set  of  beliefs.  A subset  of  these  beliefs  that  has  a spe- 
cial status  form  the  hearer’s  mental  model  of  the  ongoing  discourse,  which 
discouree  we  ca||  a discourse  model  (Webber,  1978).  The  discourse  model  contains 
representations  of  the  entities  that  have  been  referred  to  in  the  discourse  and 
the  relationships  in  which  they  participate.  Thus,  there  arc  two  components 
required  by  a system  to  successfully  produce  and  interpret  referring  expres- 
sions: a method  for  constructing  a discourse  model  that  evolves  with  the 
dynamically-changing  discourse  it  represents,  and  a method  for  mapping  be- 
tween the  signals  that  various  referring  expressions  encode  and  the  hearer’s 
set  of  beliefs,  the  latter  of  which  includes  this  discourse  model. 

We  will  speak  in  terms  of  two  fundamental  operations  to  the  discourse 
model.  When  a referent  is  first  mentioned  in  a discourse,  we  say  that  a rep- 
evoked  resentation  for  it  is  evoked  into  the  model.  Upon  subsequent  mention,  this 
accessed  representation  is  accessed  from  the  model.  The  operations  and  relationships 
arc  illustrated  in  Figure  18.1. 

We  will  restrict  our  discussion  to  reference  to  entities,  although  dis- 
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a new  Acura  Integra  -* *-  it 

corefer 


Figure  18.1  Reference  operations  and  relationships.  THIS  PICTURE  TO 
BE  REPLACED. 


courses  include  reference  to  many  other  types  of  referents.  Consider  the 
possibilities  in  example  (18.4),  adapted  from  Webber  (1991). 

(18.4)  According  to  John,  Bob  bought  Sue  an  Integra,  and  Sue  bought  Fred 
a Legend. 

a.  But  that  turned  out  to  be  a lie. 

b.  But  that  was  false. 

c.  That  struck  me  as  a funny  way  to  describe  the  situation. 

d.  That  caused  Sue  to  become  rather  poor. 

e.  That  caused  them  both  to  become  rather  poor. 

The  referent  of  that  is  a speech  act  (see  Chapter  19)  in  (18.4a),  a proposition 
in  (18.4b),  a manner  of  description  in  (18.4c),  an  event  in  (18.4d),  and  a 
combination  of  several  events  in  (18.4e).  The  field  awaits  the  development 
of  robust  methods  for  interpreting  these  types  of  reference. 

Reference  Phenomena 

The  set  of  referential  phenomena  that  natural  languages  provide  is  quite  rich 
indeed.  In  this  section,  we  provide  a brief  description  of  several  basic  ref- 
erence phenomena.  We  first  survey  five  types  of  referring  expression:  in- 
definite noun  phrases,  definite  noun  phrases,  pronouns,  demonstratives,  and 
one-anaphora.  We  then  describe  three  types  of  referents  that  complicate  the 
reference  resolution  problem:  inferrables,  discontinuous  sets,  and  generics. 

Indefinite  Noun  Phrases  Indefinite  reference  introduces  entities  that  arc 
new  to  the  hearer  into  the  discourse  context.  The  most  common  form  of 
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indefinite  reference  is  marked  with  the  determiner  a (or  an),  as  in  (18.5), 
but  it  can  also  be  marked  by  a quantifier  such  as  some  (18.6)  or  even  the 
determiner  this  (18.7). 

(18.5)  I saw  an  Acura  Integra  today. 

(18.6)  Some  Acura  Integras  were  being  unloaded  at  the  local  dealership 
today. 

(18.7)  I saw  this  awesome  Acura  Integra  today. 

Such  noun  phrases  evoke  a representation  for  a new  entity  that  satisfies  the 
given  description  into  the  discourse  model. 

The  indefinite  determiner  a does  not  indicate  whether  the  entity  is  iden- 
tifiable to  the  speaker,  which  in  some  cases  leads  to  a specific /non- specific 
ambiguity.  Example  (18.5)  only  has  the  specific  reading,  since  the  speaker 
has  a particular'  Integra  in  mind,  particularly  the  one  she  saw.  In  sentence 

(18.8) ,  on  the  other  hand,  both  readings  are  possible. 

(18.8)  I am  going  to  the  dealership  to  buy  an  Acura  Integra  today. 

That  is,  the  speaker  may  already  have  the  Integra  picked  out  (specific),  or 
may  just  be  planning  to  pick  one  out  that  is  to  her  liking  (nonspecific).  The 
readings  may  be  disambiguated  by  a subsequent  referring  expression  in  some 
contexts;  if  this  expression  is  definite  then  the  reading  is  specific  (I  hope 
they  still  have  it),  and  if  it  is  indefinite  then  the  reading  is  nonspecific  (/ 
hope  they  have  a car  I like).  This  rule  has  exceptions,  however;  for  instance 
definite  expressions  in  certain  modal  contexts  (I  will  park  it  in  my  garage) 
are  compatible  with  the  nonspecific  reading. 

Definite  Noun  Phrases  Definite  reference  is  used  to  refer  to  an  entity  that 
is  identifiable  to  the  hearer,  either  because  it  has  already  been  mentioned  in 
the  discourse  context  (and  thus  is  represented  in  the  discourse  model),  it  is 
contained  in  the  hearer’s  set  of  beliefs  about  the  world,  or  the  uniqueness  of 
the  object  is  implied  by  the  description  itself. 

The  case  in  which  the  referent  is  identifiable  from  discourse  context  is 
shown  in  (18.9). 

(18.9)  I saw  an  Acura  Integra  today.  The  Integra  was  white  and  needed  to 
be  washed. 

Examples  in  which  the  referent  is  either  identifiable  from  the  hearer’s 
set  of  beliefs  or  is  inherently  unique  are  shown  in  (18.10)  and  (18.11)  re- 
spectively. 

(18. 10)  The  Indianapolis  500  is  the  most  popular  car  race  in  the  US. 
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(18.11)  The  fastest  car  in  the  Indianapolis  500  was  an  Integra. 

Definite  noun  phrase  reference  requires  that  an  entity  be  accessed  from  either 
the  discourse  model  or  the  hearer’s  set  of  beliefs  about  the  world.  In  the  latter 
case,  it  also  evokes  a representation  of  the  referent  into  the  discourse  model. 

Pronouns  Another  form  of  definite  reference  is  pronominalization,  illus- 
trated in  example  (18.12). 

(18. 12)  I saw  an  Acura  Integra  today.  It  was  white  and  needed  to  be 
washed. 

The  constraints  on  using  pronominal  reference  are  stronger  than  for  full  defi- 
nite noun  phrases,  requiring  that  the  referent  have  a high  degree  of  activation 
or  salience  in  the  discourse  model.  Pronouns  usually  (but  not  always)  refer  salience 
to  entities  that  were  introduced  no  further  than  one  or  two  sentences  back  in 
the  ongoing  discourse,  whereas  definite  noun  phrases  can  often  refer  further 
back.  This  is  illustrated  by  the  difference  between  sentences  (18.13d)  and 
(18.13d'). 

(18. 13)  a.  John  went  to  Bob’s  party,  and  parked  next  to  a beautiful  Acura 
Integra. 

b.  He  went  inside  and  talked  to  Bob  for  more  than  an  hour. 

c.  Bob  told  him  that  he  recently  got  engaged. 

d.  ??  He  also  said  that  he  bought  it  yesterday. 

d.’  He  also  said  that  he  bought  the  Acura  yesterday. 

By  the  time  the  last  sentence  is  reached,  the  Integra  no  longer  has  the  degree 
of  salience  required  to  allow  for  pronominal  reference  to  it. 

Pronouns  can  also  participate  in  cataphora,  in  which  they  are  men-  cataphora 
tioned  before  their  referents  arc,  as  in  example  (18.14). 

(18.14)  Before  he  bought  it,  John  checked  over  the  Integra  very  carefully. 

Here,  the  pronouns  he  and  it  both  occur  before  their  referents  arc  introduced. 

Pronouns  also  appeal-  in  quantified  contexts  in  which  they  are  consid- 
ered to  be  bound,  as  in  example  (18.15).  bound 

(18.15)  Every  woman  bought  her  Acura  at  the  local  dealership. 

Under  the  relevant  reading,  her  does  not  refer  to  some  woman  in  context, 
but  instead  behaves  like  a variable  bound  to  the  quantified  expression  every 
woman.  We  will  not  be  concerned  with  the  bound  interpretation  of  pronouns 
in  this  chapter. 
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Demonstratives  Demonstrative  pronouns,  like  this  and  that , behave  some- 
what differently  that  simple  definite  pronouns  like  it.  They  can  appeal-  either 
alone  or  as  determiners,  for  instance,  this  Acura,  that  Acura.  The  choice  be- 
tween two  demonstratives  is  generally  associated  with  some  notion  of  spa- 
tial proximity:  this  indicating  closeness  and  that  signaling  distance.  Spatial 
distance  might  be  measured  with  respect  to  the  discourse  participants’  situa- 
tional context,  as  in  (18.16). 

(18. 16)  [John  shows  Bob  an  Acura  Integra  and  a Mazda  Miata] 

Bob  (pointing):  I like  this  better  than  that. 

Alternatively,  distance  can  be  metaphorically  interpreted  in  terms  of  con- 
ceptual relations  in  the  discourse  model.  For  instance,  consider  example 

(18.17) . 

(18.17)  I bought  an  Integra  yesterday.  It’s  similar  to  the  one  I bought  live 
years  ago.  That  one  was  really  nice,  but  I like  this  one  even  better. 

Here,  that  one  refers  to  the  Acura  bought  five  years  ago  (greater  temporal 
distance),  whereas  this  one  refers  to  the  one  bought  yesterday  (closer  tem- 
poral distance). 

One  Anaphora  One- anaphora,  exemplified  in  (18.18),  blends  properties 
of  definite  and  indefinite  reference. 

(18.18)  I saw  no  less  than  6 Acura  Integras  today.  Now  I want  one. 

This  use  of  one  can  be  roughly  paraphrased  by  one  of  them,  in  which 
them  refers  to  a plural  referent  (or  generic  one,  as  in  the  case  of  (18.18),  see 
below),  and  one  selects  a member  from  this  set.  Thus,  one  may  evoke  a new 
entity  into  the  discourse  model,  but  it  is  necessarily  dependent  on  an  existing 
referent  for  the  description  of  this  new  entity. 

This  use  of  one  should  be  distinguished  from  the  formal,  non-specific 
pronoun  usage  in  (18.19),  and  its  meaning  as  the  number  one  in  (18.20). 

(18.19)  One  shouldn’t  pay  more  than  twenty  thousand  dollars  for  an  Acura. 

(18.20)  John  has  two  Acuras,  but  I only  have  one. 

Inferrables  Now  that  we  have  described  several  types  of  referring  expres- 
sions, we  now  turn  our  attention  to  a few  interesting  types  of  referents  that 
complicate  the  reference  resolution  problem.  For  instance,  in  some  cases  a 
referring  expression  does  not  refer  to  an  entity  that  has  been  explicitly  evoked 
in  the  text,  but  instead  one  that  is  inferentially  related  to  an  evoked  entity. 
Such  referents  are  called  inferrables  (Haviland  and  Clark,  1974;  Prince, 
1981).  Consider  the  expressions  a door  and  the  engine  in  sentence  (18.21). 
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(18.21)  I almost  bought  an  Acura  Integra  today,  but  a door  had  a dent  and 
the  engine  seemed  noisy. 

The  indefinite  noun  phrase  a door  would  normally  introduce  a new  door  into 
the  discourse  context,  but  in  this  case  the  hearer  is  to  infer  something  more: 
that  it  is  not  just  any  door,  but  one  of  the  doors  of  the  Integra.  Similarly,  the 
use  of  the  definite  noun  phrase  the  engine  normally  presumes  that  an  engine 
has  been  previously  evoked  or  is  otherwise  uniquely  identifiable.  Here,  no 
engine  has  been  explicitly  mentioned,  but  the  hearer  infers  that  the  referent 
is  the  engine  of  the  previously  mentioned  Integra. 

Inferrables  can  also  specify  the  results  of  processes  described  by  ut- 
terances in  a discourse.  Consider  the  possible  follow-ons  (a-c)  to  sentence 

(18.22)  in  the  following  recipe  (from  Webber  and  Baldwin  (1992)): 

(18.22)  Mix  the  flour,  butter,  and  water. 

a.  Kneed  the  dough  until  smooth  and  shiny. 

b.  Spread  the  paste  over  the  blueberries. 

c.  Stir  the  batter  until  all  lumps  arc  gone. 

Any  of  the  expressions  the  dough  (a  solid),  the  batter  (a  liquid),  and  the 
paste  (somewhere  in  between)  can  be  used  to  refer  to  the  result  of  the  actions 
described  in  the  first  sentence,  but  all  imply  different  properties  of  this  result. 

Discontinuous  Sets  In  some  cases,  references  using  plural  referring  ex- 
pressions like  they  and  them  (see  page  672)  refer  to  sets  of  entities  that  arc 
evoked  together,  for  instance,  using  another  plural  expression  (their  Acuras) 
or  a conjoined  noun  phrase  (John  and  Mary ): 

(18.23)  John  and  Mary  love  their  Acuras.  They  drive  them  all  the  time. 

However,  plural  references  may  also  refer  to  sets  of  entities  that  have 
been  evoked  by  discontinuous  phrases  in  the  text: 

(18.24)  John  has  an  Acura,  and  Mary  has  a Mazda.  They  drive  them  all  the 
time. 

Here,  they  refers  to  John  and  Mary,  and  likewise  them  refers  to  the  Acura 
and  the  Mazda.  Note  also  that  the  second  sentence  in  this  case  will  gener- 
ally receive  what  is  called  a pairwise  or  respectively  reading,  in  which  John 
drives  the  Acura  and  Mary  drives  the  Mazda,  as  opposed  to  the  reading  in 
which  they  both  drive  both  cars. 

Generics  Making  the  reference  problem  even  more  complicated  is  the  ex- 
istence of  generic  reference.  Consider  example  (18.25). 
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(18.25)  I saw  no  less  than  6 Acura  Integras  today.  They  arc  the  coolest  cars. 

Here,  the  most  natural  reading  is  not  the  one  in  which  they  refers  to  the 
particular  6 Integras  mentioned  in  the  first  sentence,  but  instead  to  the  class 
of  Integras  in  general. 

Syntactic  and  Semantic  Constraints  on  Coreference 

Having  described  a variety  of  reference  phenomena  that  arc  found  in  natu- 
ral language,  we  can  now  consider  how  one  might  develop  algorithms  for 
identifying  the  referents  of  referential  expressions.  One  step  that  needs  to  be 
taken  in  any  successful  reference  resolution  algorithm  is  to  filter  the  set  of 
possible  referents  on  the  basis  of  certain  relatively  hard-and-fast  constraints. 
We  describe  some  of  these  constraints  here. 

Number  Agreement  Referring  expressions  and  their  referents  must  agree 
in  number;  for  English,  this  means  distinguishing  between  singular  and  plu- 
ral references.  A categorization  of  pronouns  with  respect  to  number  is  shown 
in  Figure  18.2. 


Singular 

Plural 

Unspecified 

she,  her,  he,  him,  his,  it 

we,  us,  they,  them 

you 

Figure  18.2  Number  agreement  in  the  English  pronominal  system. 

The  following  examples  illustrate  constraints  on  number  agreement. 

(18.26)  John  has  a new  Acura.  It  is  red. 

(18.27)  John  has  three  new  Acuras.  They  arc  red. 

(18.28)  * John  has  a new  Acura.  They  arc  red. 

(18.29)  * John  has  three  new  Acuras.  It  is  red. 

Person  and  Case  Agreement  English  distinguishes  between  three  forms 
of  person:  first,  second,  and  third.  A categorization  of  pronoun  types  with 
respect  to  person  is  shown  in  Figure  18.3. 

The  following  examples  illustrate  constraints  on  person  agreement. 

(18.30)  You  and  I have  Acuras.  We  love  them. 

(18.31)  John  and  Mary  have  Acuras.  They  love  them. 

(18.32)  * John  and  Mary  have  Acuras.  We  love  them,  (where  VVc=John  and 
Mary) 

(18.33)  * You  and  I have  Acuras.  They  love  them,  (where  77?ey=You  and  I) 
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First 

Second 

Third 

Nominative 

I,  we 

you 

he,  she,  they 

Accusative 

me,  us 

you 

him,  her,  them 

Genitive 

my,  our 

your 

his,  her,  their 

Figure  18.3  Person  and  case  agreement  in  the  English  pronominal  system/ 

In  addition,  English  pronouns  arc  constrained  by  case  agreement;  dif- 
ferent forms  of  the  pronoun  may  be  required  when  placed  in  subject  position 
(nominative  case,  e.g.,  lie,  she,  they),  object  position  (accusative  case,  e.g., 
him,  her,  them),  and  genitive  position  (genitive  case,  e.g.,  his  Acura,  her 
Acura,  their  Acura).  This  categorization  is  also  shown  in  Figure  18.3. 

Gender  Agreement  Referents  also  must  agree  with  the  gender  specified 
by  the  referring  expression.  English  third  person  pronouns  distinguish  be- 
tween male,  female,  and  nonpersonal  genders,  and  unlike  many  languages, 
the  first  two  only  apply  to  animate  entities.  Some  examples  arc  shown  in 
Figure  18.4. 


masculine 

feminine 

nonpersonal 

he,  him,  his 

she,  her 

it 

Figure  18.4  Gender  agreement  in  the  English  pronominal  system. 

The  following  examples  illustrate  constraints  on  gender  agreement. 

(18.34)  John  has  an  Acura.  He  is  attractive.  (he=John,  not  the  Acura) 

(18.35)  John  has  an  Acura.  It  is  attractive.  (it=the  Acura,  not  John) 

Syntactic  Constraints  Reference  relations  may  also  be  constrained  by  the 
syntactic  relationships  between  a referential  expression  and  a possible  an- 
tecedent noun  phrase  when  both  occur  in  the  same  sentence.  For  instance, 
the  pronouns  in  all  of  the  following  sentences  arc  subject  to  the  constraints 
indicated  in  brackets. 

(18.36)  John  bought  himself  a new  Acura.  [himself=John] 

(18.37)  John  bought  him  a new  Acura.  [him/John] 

(18.38)  John  said  that  Bill  bought  him  a new  Acura.  [him/Bill] 

(18.39)  John  said  that  Bill  bought  himself  a new  Acura.  [himself=Bill] 

(18.40)  He  said  that  he  bought  John  a new  Acura.  [He/John;he/John] 

English  pronouns  such  as  himself,  herself,  and  themselves  are  called 
reflexives.  Oversimplifying  the  situation  considerably,  a reflexive  corefers  reflexives 
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with  the  subject  of  the  most  immediate  clause  that  contains  it  (ex.  18.36), 
whereas  a nonreflexive  cannot  corefer  with  this  subject  (ex.  18.37).  That 
this  rule  applies  only  for  the  subject  of  the  most  immediate  clause  is  shown 
by  examples  (18.38)  and  (18.39),  in  which  the  opposite  reference  pattern  is 
manifest  between  the  pronoun  and  the  subject  of  the  higher  sentence.  On  the 
other  hand,  a full  noun  phrase  like  John  cannot  corefer  with  the  subject  of 
the  most  immediate  clause  nor  with  a higher-level  subject  (ex.  18.40). 

Whereas  these  syntactic  constraints  apply  to  a referring  expression 
and  a particular  potential  antecedent  noun  phrase,  these  constraints  actually 
prohibit  coreference  between  the  two  regardless  of  any  other  available  an- 
tecedents that  denote  the  same  entity.  For  instance,  normally  a nonreflexive 
pronoun  like  him  can  corefer  with  the  subject  of  the  previous  sentence  as 
it  does  in  example  (18.41),  but  it  cannot  in  example  (18.42)  because  of  the 
existence  of  the  coreferential  pronoun  he  in  the  second  clause. 

(18.41)  John  wanted  a new  car.  Bill  bought  him  a new  Acura.  [him=John] 

(18.42)  John  wanted  a new  car.  He  bought  him  a new  Acura. 
[he=John;him/John] 

The  rules  given  above  oversimplify  the  situation  in  a number  of  ways, 
and  there  are  many  cases  that  they  do  not  cover.  Indeed,  upon  further  inspec- 
tion the  facts  actually  get  quite  complicated.  In  fact,  it  is  unlikely  that  all  of 
the  data  can  be  explained  using  only  syntactic  relations  (Kuno,  1987).  For 
instance,  the  reflexive  himself  and  the  nonreflexive  him  in  sentences  (18.43) 
and  (18.44)  respectively  can  both  refer  to  the  subject  John,  even  though  they 
occur  in  identical  syntactic  configurations. 

(18.43)  John  set  the  pamphlets  about  Acuras  next  to  himself. 

[himself=John] 

(18.44)  John  set  the  pamphlets  about  Acuras  next  to  him.  [him=John] 

For  the  algorithms  discussed  later  in  this  chapter,  however,  we  will  assume  a 
syntactic  account  of  restrictions  on  intrasentential  coreference. 

Selectional  Restrictions  The  selectional  restrictions  that  a verb  places  on 
its  arguments  (see  Chapter  16)  may  be  responsible  for  eliminating  referents, 
as  in  example  (18.45). 

(18.45)  John  parked  his  Acura  in  the  garage.  He  had  driven  it  around  for 
hours. 

There  arc  two  possible  referents  for  it,  the  Acura  and  the  garage.  The  verb 
drive,  however,  requires  that  its  direct  object  denote  something  that  can  be 
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driven,  such  as  a car,  truck,  or  bus,  but  not  a garage.  Thus,  the  fact  that  the 
pronoun  appeal's  as  the  object  of  drive  restricts  the  set  of  possible  referents 
to  the  Acura.  It  is  conceivable  that  a practical  NLP  system  would  include  a 
reasonably  comprehensive  set  of  selectional  constraints  for  the  verbs  in  its 
lexicon. 

Selectional  restrictions  can  be  violated  in  the  case  of  metaphor  (see 
Chapter  16);  for  example,  consider  example  (18.46). 

(18.46)  John  bought  a new  Acura.  It  drinks  gasoline  like  you  would  not 
believe. 

While  the  verb  drink  does  not  usually  take  an  inanimate  subject,  its  metaphor- 
ical use  here  allows  it  to  refer  to  a new  Acura. 

Of  course,  there  are  more  general  semantic  constraints  that  may  come 
into  play,  but  these  are  much  more  difficult  to  encode  in  a comprehensive 
manner.  Consider  passage  (18.47). 

(18.47)  John  parked  his  Acura  in  the  garage.  It  is  incredibly  messy,  with 
old  bike  and  car  parts  lying  around  everywhere. 

Here  the  referent  of  it  is  almost  certainly  the  garage,  but  only  because  a car 
is  probably  too  small  to  have  bike  and  car  parts  laying  around  ‘everywhere’. 
Resolving  this  reference  requires  that  a system  have  knowledge  about  how 
large  cars  typically  are,  how  large  garages  typically  are,  and  the  typical  types 
of  objects  one  might  find  in  each.  On  the  other  hand,  one’s  knowledge  about 
Beverly  Hills  might  lead  one  to  assume  that  the  Acura  is  indeed  the  referent 
of  it  in  passage  (18.48). 

(18.48)  John  parked  his  Acura  in  downtown  Beverly  Hills.  It  is  incredibly 
messy,  with  old  bike  and  car  parts  lying  around  everywhere. 

In  the  end,  just  about  any  knowledge  shared  by  the  discourse  participants 
might  be  necessary  to  resolve  a pronoun  reference.  However,  due  in  part  to 
the  vastness  of  such  knowledge,  practical  algorithms  typically  do  not  rely  on 
it  heavily. 

Preferences  in  Pronoun  Interpretation 

In  the  previous  section,  we  discussed  relatively  strict  constraints  that  algo- 
rithms should  apply  when  determining  possible  referents  for  referring  ex- 
pressions. We  now  discuss  some  more  readily  violated  preferences  that  al- 
gorithms can  be  made  to  account  for.  These  preferences  have  been  posited  to 
apply  to  pronoun  interpretation  in  particular.  Since  the  majority  of  work  on 
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reference  resolution  algorithms  has  focused  on  pronoun  interpretation,  we 
will  similarly  focus  on  this  problem  in  the  remainder  of  this  section. 

Recency  Most  theories  of  reference  incorporate  the  notion  that  entities  in- 
troduced in  recent  utterances  are  more  salient  than  those  introduced  from 
utterances  further  back.  Thus,  in  example  (18.49),  the  pronoun  it  is  more 
likely  to  refer  to  the  Legend  than  the  Integra. 

(18.49)  John  has  an  Integra.  Bill  has  a Legend.  Mary  likes  to  drive  it. 

Grammatical  Role  Many  theories  specify  a salience  hierarchy  of  entities 
that  is  ordered  by  the  grammatical  position  of  the  expressions  which  denote 
them.  These  invariably  treat  entities  mentioned  in  subject  position  as  more 
salient  than  those  in  object  position,  which  arc  in  turn  more  salient  than  those 
mentioned  in  subsequent  positions. 

Passages  such  as  (18.50)  and  (18.51)  lend  support  for  such  a hierar- 
chy. Although  the  first  sentence  in  each  case  expresses  roughly  the  same 
propositional  content,  the  preferred  referent  for  the  pronoun  him  varies  with 
the  subject  in  each  case  - John  in  (18.50)  and  Bill  in  (18.51).  In  example 
(18.52),  the  references  to  John  and  Bill  arc  conjoined  within  the  subject  po- 
sition. Since  both  seemingly  have  the  same  degree  of  salience,  it  is  unclear 
to  which  the  pronoun  refers. 

(18.50)  John  went  to  the  Acura  dealership  with  Bill.  He  bought  an  Integra. 

[ he  = John  ] 

(18.51)  Bill  went  to  the  Acura  dealership  with  John.  He  bought  an  Integra. 

[ he  = Bill  ] 

(18.52)  John  and  Bill  went  to  the  Acura  dealership.  He  bought  an  Integra. 

[ he  = ??  ]. 

Repeated  Mention  Some  theories  incorporate  the  idea  that  entities  that 
have  been  focused  on  in  the  prior  discourse  are  more  likely  to  continue  to 
be  focused  on  in  subsequent  discourse,  and  hence  references  to  them  arc 
more  likely  to  be  pronominalized.  For  instance,  whereas  the  pronoun  in 
example  (18.51)  has  Bill  as  its  preferred  interpretation,  the  pronoun  in  the 
final  sentence  of  example  (18.53)  is  more  likely  to  refer  to  John. 

(18.53)  John  needed  a car  to  get  to  his  new  job.  He  decided  that  he  wanted 
something  sporty.  Bill  went  to  the  Acura  dealership  with  him.  He 
bought  an  Integra.  [ he  = John  ] 

Parallelism  There  arc  also  strong  preferences  that  appeal-  to  be  induced  by 
parallelism  effects,  as  in  example  (18.54). 
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(18.54)  Mary  went  with  Sue  to  the  Acura  dealership.  Sally  went  with  her  to 
the  Mazda  dealership.  [ her  = Sue  ] 

The  grammatical  role  hierarchy  described  above  ranks  Mary  as  more  salient 
than  Sue,  and  thus  should  be  the  preferred  referent  of  her.  Furthermore,  there 
is  no  semantic  reason  that  Mary  cannot  be  the  referent.  Nonetheless,  her  is 
instead  understood  to  refer  to  Sue. 

This  suggests  that  we  might  want  a heuristic  which  says  that  non- 
subject pronouns  prefer  non-subject  referents.  However,  such  a heuristic 
may  not  work  for  cases  that  lack  the  structural  parallelism  of  example  (18.54), 
such  as  example  (18.55),  in  which  Mary  is  the  preferred  referent  of  the  pro- 
noun instead  of  Sue. 

(18.55)  Mary  went  with  Sue  to  the  Acura  dealership.  Sally  told  her  not  to 
buy  anything.  [ her  = Mary  ] 

Verb  Semantics  Certain  verbs  appeal-  to  place  a semantically-oriented  em- 
phasis on  one  of  their  argument  positions,  which  can  have  the  effect  of  bi- 
asing the  manner  in  which  subsequent  pronouns  are  interpreted.  Compare 
sentences  (18.56)  and  (18.57). 

(18.56)  John  telephoned  Bill.  He  lost  the  pamphlet  on  Acuras. 

(18.57)  John  criticized  Bill.  He  lost  the  pamphlet  on  Acuras. 

These  examples  differ  only  in  the  verb  used  in  the  first  sentence,  yet  the 
subject  pronoun  in  passage  (18.56)  is  typically  resolved  to  John,  whereas 
the  pronoun  in  passage  (18.57)  is  resolved  to  Bill.  Some  researchers  have 
claimed  that  this  effect  results  from  what  has  been  called  the  ‘implicit  causal- 
ity’ of  a verb:  the  implicit  cause  of  a ‘criticizing’  event  is  considered  to  be 
its  object,  whereas  the  implicit  cause  of  a ‘telephoning’  event  is  considered 
to  be  its  subject.  This  emphasis  results  in  a higher  degree  of  salience  for  the 
entity  in  this  argument  position,  which  leads  to  the  different  preferences  for 
examples  (18.56)  and  (18.57). 

Similar  preferences  have  been  articulated  in  terms  of  the  thematic  roles 
(see  Chapter  16)  that  the  potential  antecedents  occupy.  For  example,  most 
hearers  resolve  He  to  John  in  example  (18.58)  and  to  Bill  in  example  (18.59). 
Although  these  referents  are  evoked  from  different  grammatical  role  po- 
sitions, they  both  fill  the  Goal  thematic  role  of  their  corresponding  verbs, 
whereas  the  other  potential  referent  tills  the  Source.  Likewise,  hearers  gen- 
erally resolve  He  to  John  and  Bill  in  examples  (18.60)  and  (18.61)  respec- 
tively, providing  evidence  that  tillers  of  the  Stimulus  role  are  preferred  over 
tillers  of  the  Experience!-  role. 
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(18.58)  John  seized  the  Acura  pamphlet  from  Bill.  He  loves  reading  about 
cars.  (Goal=John,  Source=Bill) 

(18.59)  John  passed  the  Acura  pamphlet  to  Bill.  He  loves  reading  about 
cars.  (Goal=Bill,  Source=John) 

(18.60)  The  car  dealer  admired  John.  He  knows  Acuras  inside  and  out. 
(Stimulus=John,  Experiencer=the  car  dealer) 

(18.61)  The  cai-  dealer  impressed  John.  He  knows  Acuras  inside  and  out. 
(Stimulus=the  car  dealer,  Experiencer=John) 
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An  Algorithm  for  Pronoun  Resolution 

None  of  the  algorithms  for  pronoun  resolution  that  have  been  proposed  to 
date  successfully  account  for  all  of  these  preferences,  let  alone  succeed  in 
resolving  the  contradictions  that  will  arise  between  them.  However,  Lappin 
and  Leass  (1994)  describe  a straightforward  algorithm  for  pronoun  inter- 
pretation that  takes  many  of  these  into  consideration.  The  algorithm  em- 
ploys a simple  weighting  scheme  that  integrates  the  effects  of  the  recency 
and  syntactically-based  preferences;  no  semantic  preferences  are  employed 
beyond  those  enforced  by  agreement.  We  describe  a slightly  simplified  por- 
tion of  the  algorithm  that  applies  to  non-reflexive,  third  person  pronouns. 

Broadly  speaking,  there  are  two  types  of  operations  performed  by  the 
algorithm:  discourse  model  update  and  pronoun  resolution.  First,  when  a 
noun  phrase  that  evokes  a new  entity  is  encountered,  a representation  for  it 
must  be  added  to  the  discourse  model  and  a degree  of  salience  (which  we 
call  a salience  value)  computed  for  it.  The  salience  value  is  calculated  as 
the  sum  of  the  weights  assigned  by  a set  of  salience  factors.  The  salience 
factors  used  and  their  corresponding  weights  are  shown  in  Figure  18.5. 


Sentence  recency 

100 

Subject  emphasis 

80 

Existential  emphasis 

70 

Accusative  (direct  object)  emphasis 

50 

Indirect  object  and  oblique  complement  emphasis 

40 

Non-adverbial  emphasis 

50 

Head  noun  emphasis 

80 

Figure  18.5  Salience  factors  in  Lappin  and  Leass’s  system. 

The  weights  that  each  factor  assigns  to  an  entity  in  the  discourse  model 
are  cut  in  half  each  time  a new  sentence  is  processed.  This,  along  with 
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the  added  effect  of  the  sentence  recency  weight  (which  initially  assigns  a 
weight  of  100,  to  be  cut  in  half  with  each  succeeding  sentence),  captures  the 
Recency  preference  described  on  page  676,  since  referents  mentioned  in  the 
current  sentence  will  tend  to  have  higher  weights  than  those  in  the  previous 
sentence,  which  will  in  turn  be  higher  than  those  in  the  sentence  before  that, 
and  so  forth. 

Similarly,  the  next  five  factors  in  Figure  18.5  can  be  viewed  as  a way 
of  encoding  a grammatical  role  preference  scheme  using  the  following  hier- 
archy: 

subject  > existential  predicate  nominal  > object  > indirect  ob- 
ject or  oblique  > demarcated  adverbial  PP 

These  five  positions  arc  exemplified  by  the  position  of  the  italicized  phrases 
in  examples  (18.62)— (18.66)  respectively. 

(18.62)  An  Acura  Integra  is  parked  in  the  lot.  (subject) 

(18.63)  There  is  an  Acura  Integra  parked  in  the  lot.  (existential  predicate 
nominal) 

(18.64)  John  parked  an  Acura  Integra  in  the  lot.  (object) 

(18.65)  John  gave  his  Acura  Integra  a bath,  (indirect  object) 

(18.66)  Inside  his  Acura  Integra , John  showed  Susan  his  new  CD  player, 
(demarcated  adverbial  PP) 

The  preference  against  referents  in  demarcated  adverbial  PPs  (i.e.,  those  sep- 
arated by  punctuation,  as  with  the  comma  in  example  (18.66))  is  encoded  as 
a positive  weight  of  50  for  every  other  position,  listed  as  the  non-adverbial 
emphasis  weight  in  Figure  18.5.  This  ensures  that  the  weight  for  any  ref- 
erent is  always  positive,  which  is  necessary  so  that  the  effect  of  halving  the 
weights  is  always  to  reduce  them. 

The  head  noun  emphasis  factor  penalizes  referents  which  arc  embed- 
ded in  larger  noun  phrases,  again  by  promoting  the  weights  of  referents  that 
arc  not.  Thus,  the  Acura  Integra  in  each  of  examples  (18. 62)— ( 1 8 . 66)  will 
receive  80  points  for  being  denoted  by  a head  noun,  whereas  the  Acura  Inte- 
gra in  example  (18.67)  will  not,  since  it  is  embedded  within  the  subject  noun 
phrase. 

(18.67)  The  owner’s  manual  for  an  Acura  Integra  is  on  John’s  desk. 

Each  of  these  factors  contributes  to  the  salience  of  a referent  based  on 
the  properties  of  the  noun  phrase  that  denotes  it.  Of  course,  it  could  be  that 
several  noun  phrases  in  the  preceding  discourse  refer  to  the  same  referent. 
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each  being  assigned  a different  level  of  salience,  and  thus  we  need  a way 
in  which  to  combine  the  contributions  of  each.  To  address  this,  Lappin  and 
Leass  associate  with  each  referent  an  equivalence  class  that  contains  all  of 
the  noun  phrases  that  have  been  determined  to  refer  to  it.  The  weight  that 
a salience  factor  assigns  to  a referent  is  the  highest  of  the  weights  it  assigns 
to  the  members  of  its  equivalence  class.  The  salience  weight  for  a refer- 
ent is  then  calculated  by  summing  these  weights  for  each  factor.  The  scope 
of  a salience  factor  is  a sentence,  so,  for  instance,  if  a potential  referent  is 
mentioned  in  the  current  sentence  as  well  as  the  previous  one,  the  sentence 
recency  weight  will  be  factored  in  for  each.  (On  the  other  hand,  if  the  same 
referent  is  mentioned  more  than  once  in  the  same  sentence,  this  weight  will 
be  counted  only  once.)  Thus,  multiple  mentions  of  a referent  in  the  prior  dis- 
course can  potentially  increase  its  salience,  which  has  the  effect  of  encoding 
the  preference  for  repeated  mentions  discussed  on  page  676. 

Once  we  have  updated  the  discourse  model  with  new  potential  refer- 
ents and  recalculated  the  salience  values  associated  with  them,  we  arc  ready 
to  consider  the  process  of  resolving  any  pronouns  that  exist  within  a new 
sentence.  In  doing  this,  we  factor  in  two  more  salience  weights,  one  for 
grammatical  role  parallelism  between  the  pronoun  and  the  potential  refer- 
ent, and  one  to  disprefer  cataphoric  reference.  The  weights  are  shown  in 
Figure  18.6.  Unlike  the  other  preferences,  these  two  cannot  be  calculated 
independently  of  the  pronoun,  and  thus  cannot  be  calculated  during  the  dis- 
course model  update  step.  We  will  use  the  term  initial  salience  value  for  the 
weight  of  a given  referent  before  these  factors  are  applied,  and  the  item  final 
salience  value  for  after  they  have  applied. 


Role  Parallelism 

35 

Cataphora 

-175 

Figure  18.6  Per  pronoun  salience  weights  in  Lappin  and  Leass’s  system. 

We  are  now  ready  to  specify  the  pronoun  resolution  algorithm.  Assum- 
ing that  the  discourse  model  has  been  updated  to  reflect  the  initial  salience 
values  of  referents  as  described  above,  the  steps  taken  to  resolve  a pronoun 
are  as  follows: 

1.  Collect  the  potential  referents  (up  to  four  sentences  back). 

2.  Remove  potential  referents  that  do  not  agree  in  number  or  gender  with 
the  pronoun. 
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3.  Remove  potential  referents  that  do  not  pass  intrasentential  syntactic 
coreference  constraints  (as  described  on  page  673). 

4.  Compute  the  total  salience  value  of  the  referent  by  adding  any  appli- 
cable values  from  Figure  18.6  to  the  existing  salience  value  previously 
computed  during  the  discourse  model  update  step  (i.e.,  the  sum  of  the 
applicable  values  in  Figure  18.5). 

5.  Select  the  referent  with  the  highest  salience  value.  In  the  case  of  ties, 
select  the  closest  referent  in  terms  of  string  position  (computed  without 
bias  to  direction). 

We  illustrate  the  operation  of  the  algorithm  by  stepping  through  exam- 
ple (18.68). 

(18.68)  John  saw  a beautiful  Acura  Integra  at  the  dealership.  He  showed  it 
to  Bob.  He  bought  it. 

We  first  process  the  first  sentence  to  collect  potential  referents  and 
compute  their  initial  salience  values.  The  following  table  shows  the  con- 
tribution to  salience  from  each  of  the  salience  factors. 


Rec 

Subj 

Exist 

Obj 

Ind-Obj 

Non-Adv 

Head  N 

Total 

John 

100 

80 

mm 

80 

310 

Integra 

100 

50 

.19 

80 

280 

dealership 

100 

50 

80 

230 

There  arc  no  pronouns  to  be  resolved  in  this  sentence,  so  we  move 
on  to  the  next,  degrading  the  above  values  by  a factor  of  two  as  shown  in 
the  following  table.  The  phrases  column  shows  the  equivalence  class  of 
referring  expressions  for  each  referent. 


Referent 

Phrases 

Value 

John 

{ John  } 

Kgs 

Integra 

{ a beautiful  Acura  Integra  } 

IB 

dealership 

{ the  dealership  } 

m 

The  first  noun  phrase  in  the  second  sentence  is  the  pronoun  he.  Because 
he  specifies  male  gender,  Step  2 of  the  resolution  algorithm  reduces  the  set 
of  possible  referents  to  include  only  John,  so  we  can  stop  there  and  take  this 
to  be  the  referent. 
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The  discourse  model  must  now  be  updated.  First,  the  pronoun  he  is 
added  in  the  equivalence  class  for  John.  Since  he  occurs  in  the  current  sen- 
tence and  John  in  the  previous  one,  the  salience  factors  do  not  overlap  be- 
tween the  two.  The  pronoun  is  in  the  current  sentence  (recency=100),  subject 
position  (=80),  not  in  an  adverbial  (=50),  and  not  embedded  (=80),  and  so  a 
total  of  310  is  added  to  the  current  weight  for  John: 


Referent 

Phrases 

Value 

John 

{ John,  he  \ } 

465 

Integra 

{ a beautiful  Acura  Integra  } 

140 

dealership 

{ the  dealership  } 

115 

The  next  noun  phrase  in  the  second  sentence  is  the  pronoun  it,  which  is 
compatible  with  the  Integra  or  the  dealership.  We  first  need  to  compute  the 
final  salience  values  by  adding  the  applicable  weights  from  Figure  18.6  to 
the  initial  salience  values  above.  Neither  referent  assignment  would  result  in 
cataphora,  so  that  factor  does  not  apply.  For  the  parallelism  preference,  both 
it  and  a beautiful  Acura  Integra  arc  in  object  position  within  their  respective 
sentences  (whereas  the  dealership  is  not),  so  a weight  of  35  is  added  to  this 
option.  With  the  Integra  having  a weight  of  175  and  the  dealership  a weight 
of  1 15,  the  Integra  is  taken  to  be  the  referent. 

Again,  the  discourse  model  must  now  be  updated.  Since  it  is  in  a 
nonembedded  object  position,  it  receives  a weight  of  100+50+50+80=280, 
and  is  added  to  the  current  weight  for  the  Integra. 


Referent 

Phrases 

Value 

John 

{ John,  he  \ } 

Integra 

{ a beautiful  Acura  Integra,  it\  } 

dealership 

{ the  dealership  } 

115 

The  final  noun  phrase  in  the  second  sentence  is  Bob,  which  introduces 
a new  discourse  referent.  Since  it  occupies  an  oblique  argument  position,  it 
receives  a weight  of  100+40+50+80=270. 


Referent 

Phrases 

Value 

John 

{ John,  he  \ } 

465 

Integra 

{ a beautiful  Acura  Integra,  it\  } 

420 

Bob 

{Bob} 

270 

dealership 

{ the  dealership  } 

115 

Now  we  are  ready  to  move  on  to  the  final  sentence.  We  again  degrade 
the  current  weights  by  one  half. 
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Referent 

Phrases 

Value 

John 

{ John , he  \ } 

232.5 

Integra 

{ a beautiful  Acura  Integra , it  \ } 

210 

Bob 

{Bob} 

135 

dealership 

{ the  dealership  } 

57.5 

The  reader  can  confirm  that  the  referent  of  he  will  be  resolved  to  John, 
and  the  referent  of  it  to  the  Integra. 

The  weights  used  by  Lappin  and  Leass  were  arrived  at  by  experimenta- 
tion on  a development  coipus  of  computer  training  manuals.  This  algorithm, 
when  combined  with  several  filters  not  described  here,  achieved  86%  accu- 
racy when  applied  to  unseen  test  data  within  the  same  genre.  It  is  possible 
that  these  exact  weights  may  not  be  optimal  for  other  genres  (and  even  more 
so  for  other  languages),  so  the  reader  may  want  to  experiment  with  these  on 
training  data  for  a new  application  or  language. 

In  Exercise  18.7,  we  consider  a version  of  the  algorithm  that  relies 
only  on  a noun  phrase  identifier  (see  also  Kennedy  and  Boguraev  (1996)).  In 
the  next  paragraphs,  we  briefly  summarize  two  other  approaches  to  pronoun 
resolution. 

A Tree  Search  Algorithm  Hobbs  (1978b)  describes  an  algorithm  for  pro- 
noun resolution  which  takes  the  syntactic  representations  of  the  sentences  up 
to  and  including  the  current  sentence  as  input,  and  performs  a search  for  an 
antecedent  noun  phrase  on  these  trees.  There  is  no  explicit  representation  of 
a discourse  model  or  preferences  as  in  the  Lappin  and  Leass  algorithm.  How- 
ever, certain  of  these  preferences  arc  approximated  by  the  order  in  which  the 
search  on  syntactic  frees  is  performed. 

An  algorithm  that  searches  parse  trees  must  also  specify  a grammar, 
since  the  assumptions  regarding  the  structure  of  syntactic  frees  will  affect 
the  results.  A fragment  for  English  that  the  algorithm  uses  is  given  in  Lig- 
ure  18.7.  The  steps  of  the  algorithm  arc  as  follows. 

1.  Begin  at  the  noun  phrase  (NP)  node  immediately  dominating  the  pro- 
noun. 

2.  Go  up  the  free  to  the  first  NP  or  sentence  (S)  node  encountered.  Call 
this  node  X,  and  call  the  path  used  to  reach  it  p. 

3.  Traverse  all  branches  below  node  X to  the  left  of  path  p in  a left-to- 
right,  breadth-first  fashion.  Propose  as  the  antecedent  any  NP  node 
that  is  encountered  which  has  an  NP  or  S node  between  it  and  X. 

4.  If  node  X is  the  highest  S node  in  the  sentence,  traverse  the  surface 
parse  trees  of  previous  sentences  in  the  text  in  order  of  recency,  the 
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NPVP 


NP 


Det 

PP 


(Det)  Nominal 


pronoun 
( determiner 
' (NP  's 
preposition  NP 
Nominal  — > noun  (PP) 
Rel  — > wh-word  S 
VP  verb  NP  (PP)* 


Figure  18.7  A grammar  fragment  for  the  Tree  Search  algorithm. 


most  recent  first;  each  tree  is  traversed  in  a left-to-right,  breadth-first 
manner,  and  when  an  NP  node  is  encountered,  it  is  proposed  as  an- 
tecedent. If  X is  not  the  highest  S node  in  the  sentence,  continue  to 
step  5. 

5.  From  node  X,  go  up  the  tree  to  the  first  NP  or  S node  encountered.  Call 
this  new  node  X,  and  call  the  path  traversed  to  reach  it  p. 

6.  If  X is  an  NP  node  and  if  the  path  p to  X did  not  pass  through  the  Nom- 
inal node  that  X immediately  dominates,  propose  X as  the  antecedent. 

7.  Traverse  all  branches  below  node  X to  the  left  of  path  p in  a left-to- 
right,  breadth-first  manner.  Propose  any  NP  node  encountered  as  the 
antecedent. 

8.  If  X is  an  S node,  traverse  all  branches  of  node  X to  the  right  of  path 
p in  a left-to-right,  breadth-first  manner,  but  do  not  go  below  any  NP 
or  S node  encountered.  Propose  any  NP  node  encountered  as  the  an- 
tecedent. 

9.  Go  to  Step  4. 

Demonstrating  that  this  algorithm  yields  the  correct  coreference  assignments 
for  example  (18.68)  is  left  as  Exercise  18.3. 

As  stated,  the  algorithm  depends  on  complete  and  correct  syntactic 
structures  as  input.  Hobbs  evaluated  his  approach  manually  (with  respect 
to  both  parse  construction  and  algorithm  implementation)  on  one  hundred 
examples  from  each  of  three  different  texts,  reporting  an  accuracy  of  88.3%. 
(The  accuracy  increases  to  91.7%  if  certain  selectional  restriction  constraints 
are  assumed.)  Lappin  and  Leass  encoded  a version  of  this  algorithm  within 
their  system,  and  reported  an  accuracy  of  82%  on  their  test  corpus.  Although 
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this  is  less  than  the  86%  accuracy  achieved  by  their  own  algorithm,  it  should 
be  borne  in  mind  that  the  test  data  Lappin  and  Leass  used  was  from  the  same 
genre  as  their  development  set,  but  different  than  the  genres  that  Hobbs  used 
in  developing  his  algorithm. 

A Centering  Algorithm  As  we  described  above,  the  Hobbs  algorithm 
does  not  use  an  explicit  representation  of  a discourse  model.  The  Lappin 
and  Leass  algorithm  does,  but  encodes  salience  as  a weighted  combination 
of  preferences.  Centering  theory  (Grosz  et  al.,  1995,  henceforth  GJW),  also 
has  an  explicit  representation  of  a discourse  model,  and  incorporates  an  ad- 
ditional claim:  that  there  is  a single  entity  being  ‘centered’  on  at  any  given 
point  in  the  discourse  which  is  to  be  distinguished  from  all  other  entities  that 
have  been  evoked. 

There  arc  two  main  representations  tracked  in  the  discourse  model.  In 
what  follows,  take  Un  and  Un+\  to  be  two  adjacent  utterances.  The  backward 
looking  center  of  Un , denoted  as  Cb(Un),  represents  the  entity  currently  be- 
ing focused  on  in  the  discourse  after  Un  is  interpreted.  The  forward  looking 
centers  of  Un , denoted  as  Cf(Un),  form  an  ordered  list  containing  the  entities 
mentioned  in  Un,  all  of  which  could  serve  as  the  Q of  the  following  utter- 
ance. In  fact,  Ch(Un+])  is  by  definition  the  most  highly  ranked  element  of 
Cf(Un ) mentioned  in  Un+\.  (The  C/,  of  the  first  utterance  in  a discourse  is 
undefined.)  As  for  how  the  entities  in  the  Cf(Un)  arc  ordered,  for  simplic- 
ity’s sake  we  can  use  the  grammatical  role  hierarchy  encoded  by  (a  subset 
of)  the  weights  in  the  Lappin  and  Leass  algorithm,  repeated  below. 1 

subject  > existential  predicate  nominal  > object  > indirect  ob- 
ject or  oblique  > demarcated  adverbial  PP 

Unlike  the  Lappin  and  Leass  algorithm,  however,  there  arc  no  numerical 
weights  attached  to  the  entities  on  the  list,  they  arc  simply  ordered  relative  to 
each  other.  As  a shorthand,  we  will  call  the  highest-ranked  forward-looking 
center  Cp  (for  ‘preferred  center’). 

We  describe  a centering-based  algorithm  for  pronoun  interpretation 
due  to  Brennan  et  al.  (1987,  henceforth  BFP).  (See  also  Walker  et  al.  (1994); 
for  alternatives,  see  Kameyama  (1986)  and  Strube  and  Hahn  (1996),  inter 
alia.)  In  this  algorithm,  preferred  referents  of  pronouns  are  computed  from 
relations  that  hold  between  the  forward  and  backward  looking  centers  in 
adjacent  sentences.  Four  intersentential  relationships  between  a pair  of  ut- 
terances Un  and  f/,,4  i are  defined  depending  on  the  relationship  between 

1 This  is  an  extended  form  of  the  hierarchy  used  in  Brennan  et  al.  (1987),  described  below. 
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Cb(Un+ 1),  Cb(Un),  and  Cp(Un+ 1);  these  arc  shown  in  Figure  18.8. 


Cb(Un+l)=Cb(Un) 
or  undefined  C/,  ( Un ) 

Cb(Un+l)^Cb(Un) 

Cb{Un+ 1)  = Cp(Un+ 1) 

Continue 

Smooth-Shift 

Cb(Un+l)^Cp(Un+ 1) 

Retain 

Rough-Shift 

Figure  18.8  Transitions  in  the  BFP  algorithm. 

The  following  rules  arc  used  by  the  algorithm. 

• Rule  1:  If  any  element  of  Cf(Un ) is  realized  by  a pronoun  in  utterance 
Un+i,  then  Cb(U„+i ) must  be  realized  as  a pronoun  also. 

• Rule  2:  Transition  states  are  ordered.  Continue  is  preferred  to  Retain 
is  preferred  to  Smooth-Shift  is  preferred  to  Rough-Shift. 

Having  defined  these  concepts  and  rules,  the  algorithm  is  defined  as 
follows. 

1.  Generate  possible  Cb-Cf  combinations  for  each  possible  set  of  refer- 
ence assignments 

2.  Filter  by  constraints,  e.g.,  syntactic  coreference  constraints,  selectional 
restrictions,  centering  rules  and  constraints 

3.  Rank  by  transition  orderings 

The  pronominal  referents  that  get  assigned  arc  those  which  yield  the  most 
preferred  relation  in  Rule  2,  assuming  that  Rule  1 and  other  coreference 
constraints  (gender,  number,  syntactic,  selectional  restrictions)  arc  not  vio- 
lated. 

Let  us  step  through  passage  (18.68),  repeated  below  as  (18.69),  to  il- 
lustrate the  algorithm. 

(18.69)  John  saw  a beautiful  Acura  Integra  at  the  dealership.  (U\ ) 

He  showed  it  to  Bob.  ( U2 ) 

He  bought  it.  (U3) 

Using  the  grammatical  role  hierarchy  to  order  the  C /,  for  sentence  U\  we 
get: 

Cf(U\  ):  {John,  Integra,  dealership} 

Cp(U\):  John 
Cb(Ui):  undefined 
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Sentence  U2  contains  two  pronouns:  he,  which  is  compatible  with  John,  and 
it,  which  is  compatible  with  the  Acura  or  the  dealership.  John  is  by  definition 
Cb(U2),  because  he  is  the  highest  ranked  member  of  Cf{U\)  mentioned  in 
U2  (again,  he  is  the  only  possible  referent  for  he).  We  compare  the  resulting 
transitions  for  each  possible  referent  of  it.  If  we  assume  it  refers  to  the  Acura, 
the  assignments  would  be: 

Cf(U2):  {John,  Integra,  Bob} 

CP{U2):  John 
Ch{U2):  John 

Result:  Continue  (Cp(U2)=Cb(U2);  Cb(U\)  undefined) 

If  we  assume  it  refers  to  the  dealership,  the  assignments  would  be: 

Cf(U2 ):  {John,  dealership.  Bob} 

Cp(U2):  John 
Cb(U2):  John 

Result:  Continue  ( Cp(U2)=Cb(U2 );  Cb(U\)  undefined) 

Since  both  possibilities  result  in  a Continue  transition,  the  algorithm  does 
not  say  which  to  accept.  For  the  sake  of  illustration,  we  will  assume  that  ties 
arc  broken  in  terms  of  the  ordering  on  the  previous  Cf  list.  Thus,  we  will 
take  it  to  refer  to  the  Integra  instead  of  the  dealership,  leaving  the  current 
discourse  model  as  represented  in  the  first  possibility  above. 

In  sentence  U3,  he  is  compatible  with  either  John  or  Bob,  whereas  it 
is  compatible  with  the  Integra.  If  we  assume  he  refers  to  John,  then  John  is 
Cb(U3)  and  the  assignments  would  be: 

Cf(U 3):  {John,  Acura} 

CP(U3):  John 
Cb(U3):  John 

Result:  Continue  (Cp(U3)=Cb(U3)=Cb(U2)) 

If  we  assume  he  refers  to  Bob,  then  Bob  is  Cb  (U3 ) and  the  assignments  would 
be: 

Cf(U 3):  {Bob,  Acura} 

CP(U3):  Bob 
Cb(U3):  Bob 

Result:  Smooth-Shift  (Cp(U3)=Cb(U3);  Cb(U3)^Cb{U2)) 

Since  a Continue  is  preferred  to  a Smooth-Shift  per  Rule  2,  John  is  correctly 
taken  to  be  the  referent. 
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The  main  salience  factors  that  the  centering  algorithm  implicitly  incor- 
porates include  the  grammatical  role,  recency,  and  repeated  mention  pref- 
erences. Unlike  the  Lappin  and  Leass  algorithm,  however,  the  manner  in 
which  the  grammatical  role  hierarchy  affects  salience  is  indirect,  since  it  is 
the  resulting  transition  type  that  determines  the  final  reference  assignments. 
In  particular,  a referent  in  a low-ranked  grammatical  role  will  be  preferred  to 
one  in  a more  highly  ranked  role  if  the  former  leads  to  a more  highly  ranked 
transition.  Thus,  the  centering  algorithm  may  (often,  but  not  always,  incor- 
rectly) resolve  a pronoun  to  a referent  that  other  algorithms  would  consider 
to  be  of  relatively  low  salience  (Lappin  and  Leass,  1994;  Kehler,  1997a).  For 
instance,  in  example  (18.70), 

(18.70)  Bob  opened  up  a new  dealership  last  week.  John  took  a look  at  the 

Acuras  in  his  lot.  He  ended  up  buying  one. 

the  centering  algorithm  will  assign  Bob  as  the  referent  of  the  subject  pronoun 
he  in  the  third  sentence  - since  Bob  is  QjLO),  this  assignment  results  in  a 
Continue  relation  whereas  assigning  John  results  in  a Smooth-Shift  relation. 
On  the  other  hand,  the  Hobbs  and  Lappin/Leass  algorithms  will  assign  John 
as  the  referent. 

Like  the  Hobbs  algorithm,  the  centering  algorithm  was  developed  on 
the  assumption  that  correct  syntactic  structures  are  available  as  input.  In 
order  to  perform  an  automatic  evaluation  on  naturally  occurring  data,  the 
centering  algorithm  would  have  to  be  specified  in  greater  detail,  both  in  terms 
of  how  all  noun  phrases  in  a sentence  are  ordered  with  respect  to  each  other 
on  the  Cf  list  (the  current  approach  only  includes  nonembedded  tillers  of 
certain  grammatical  roles,  generating  only  a partial  ordering),  as  well  as  how 
all  pronouns  in  a sentence  can  be  resolved  (e.g.,  recall  the  indeterminacy  in 
resolving  it  in  the  second  sentence  of  example  (18.68)). 

Walker  (1989),  however,  performed  a manual  evaluation  of  the  center- 
ing algorithm  on  a corpus  of  28 1 examples  distributed  over  texts  from  three 
genres,  and  compared  its  performance  to  the  Hobbs  algorithm.  The  evalua- 
tion assumed  adequate  syntactic  representations,  grammatical  role  labeling, 
and  selectional  restriction  information  as  input.  Furthermore,  in  cases  in 
which  the  centering  algorithm  did  not  uniquely  specify  a referent,  only  those 
cases  in  which  the  Hobbs  algorithm  identified  the  correct  one  were  counted 
as  errors.  With  this  proviso.  Walker  reports  an  accuracy  of  77.6%  for  cen- 
tering and  81.8%  for  Hobbs.  See  also  Tetreault  (1999)  for  a comparison 
between  several  centering-based  algorithms  and  the  Hobbs  algorithm. 
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18.2  Text  Coherence 

Much  of  the  previous  section  focussed  on  the  nature  of  anaphoric  reference 
and  methods  for  resolving  pronouns  in  discourse.  Anaphoric  expressions 
have  often  been  called  cohesive  devices  (Halliday  and  Hasan,  1976),  since  d°vicese 
the  coreference  relations  they  establish  serve  to  ‘tie’  different  parts  of  a dis- 
course together,  thus  making  it  cohesive.  While  discourses  often  contain  co- 
hesive devices,  the  existence  of  such  devices  alone  does  not  satisfy  a stronger 
requirement  that  a discourse  must  meet,  that  of  being  coherent.  In  this  sec- 
tion, we  describe  what  it  means  for  a text  to  be  coherent,  and  computational 
mechanisms  for  determining  coherence. 

The  Phenomenon 

Assume  that  you  have  collected  an  arbitrary  set  of  well-formed  and  inde- 
pendently interpretable  utterances,  for  instance,  by  randomly  selecting  one 
sentence  from  each  of  the  previous  chapters  of  this  book.  Do  you  have  a 
discourse?  Almost  certainly  not.  The  reason  is  that  these  utterances,  when 
juxtaposed,  will  not  exhibit  coherence.  Consider,  for  example,  the  differ-  coherence 
ence  between  passages  (18.71)  and  (18.72). 

(18.71)  John  hid  Bill’s  car  keys.  He  was  drunk. 

(18.72)  ??  John  hid  Bill’s  car  keys.  He  likes  spinach. 

While  most  people  find  passage  (18.71)  to  be  rather  unremarkable,  they 
find  passage  (18.72)  to  be  odd.  Why  is  this  so?  Like  passage  (18.71), 
the  sentences  that  make  up  passage  (18.72)  arc  well  formed  and  readily 
interpretable.  Something  instead  seems  to  be  wrong  with  the  fact  that  the 
sentences  arc  juxtaposed.  The  hearer  might  ask,  for  instance,  what  hiding 
someone’s  car  keys  has  to  do  with  liking  spinach.  By  asking  this,  the  hearer 
is  questioning  the  coherence  of  the  passage. 

Alternatively,  the  hearer  might  try  to  construct  an  explanation  that 
makes  it  coherent,  for  instance,  by  conjecturing  that  perhaps  someone  of- 
fered John  spinach  in  exchange  for  hiding  Bill’s  car  keys.  In  fact,  if  we  con- 
sider a context  in  which  we  had  known  this  already,  the  passage  now  sounds 
a lot  better!  Why  is  this?  This  conjecture  allows  the  hearer  to  identify  John’s 
liking  spinach  as  the  cause  of  his  hiding  Bill’s  car  keys,  which  would  explain 
how  the  two  sentences  arc  connected.  The  very  fact  that  hearers  tty  to  iden- 
tify such  connections  is  indicative  of  the  need  to  establish  coherence  as  paid 
of  discourse  comprehension. 
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The  possible  connections  between  utterances  in  a discourse  can  be 
relationse  specified  as  a set  of  coherence  relations.  A few  such  relations,  proposed 
by  Hobbs  (1979a),  arc  given  below.  The  terms  So  and  Si  represent  the  mean- 
ings of  the  two  sentences  being  related. 

Result:  Infer  that  the  state  or  event  asserted  by  So  causes  or  could  cause  the 
state  or  event  asserted  by  Si. 

(18.73)  John  bought  an  Acura.  His  father  went  ballistic. 

Explanation:  Infer  that  the  state  or  event  asserted  by  Si  causes  or  could 
cause  the  state  or  event  asserted  by  So- 

(18.74)  John  hid  Bill’s  car  keys.  He  was  drunk. 

Parallel:  Infer  p{a\  . «2-  from  the  assertion  of  So  and  p{b\  .bi....)  from 

the  assertion  of  Si,  where  a,  and  b,  arc  similar,  for  all  i. 

(18.75)  John  bought  an  Acura.  Bill  leased  a BMW. 

Elaboration:  Infer  the  same  proposition  P from  the  assertions  of  So  and  Si. 

(18.76)  John  bought  an  Acura  this  weekend.  He  purchased  a beautiful  new 
Integra  for  20  thousand  dollars  at  Bill’s  dealership  on  Saturday 
afternoon. 

Occasion:  A change  of  state  can  be  inferred  from  the  assertion  of  So,  whose 
final  state  can  be  inferred  from  Si,  or  a change  of  state  can  be  inferred  from 
the  assertion  of  Si,  whose  initial  state  can  be  inferred  from  So- 

(18.77)  John  bought  an  Acura.  He  drove  to  the  ballgame. 

A mechanism  for  identifying  coherence  could  support  a number  of  nat- 
ural language  applications,  including  information  extraction  and  summariza- 
tion. For  example,  discourses  that  arc  coherent  by  virtue  of  the  Elaboration 
relation  arc  often  characterized  by  a summary  sentence  followed  by  one  or 
more  sentences  adding  detail  to  it,  as  in  passage  (18.76).  Although  there 
arc  two  sentences  describing  events  in  this  passage,  the  fact  that  we  infer  an 
Elaboration  relation  tells  us  that  the  same  event  is  being  described  in  each. 
A mechanism  for  identifying  this  fact  could  tell  an  information  extraction 
or  summarization  system  to  merge  the  information  from  the  sentences  and 
produce  a single  event  description  instead  of  two. 
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An  Inference  Based  Resolution  Algorithm 

Each  coherence  relation  described  above  is  associated  with  one  or  more  con- 
straints that  must  be  met  for  it  to  hold.  How  can  we  apply  these  constraints? 
To  do  this,  we  need  a method  for  performing  inference.  Perhaps  the  most 
familial-  type  of  inference  is  deduction;  recall  from  Section  14.3  that  the 
central  rule  of  deduction  is  modus  ponens: 

a =>  P 
a 

An  example  of  modus  ponens  is  the  following: 

All  Acuras  are  fast. 

John’s  cai'  is  an  Acura. 

John’s  cai'  is  fast. 

Deduction  is  a form  of  sound  inference:  if  the  premises  are  true,  then  the 
conclusion  must  be  true. 

However,  much  of  language  understanding  is  based  on  inferences  that 
are  not  sound.  While  the  ability  to  draw  unsound  inferences  allows  for  a 
greater  range  of  inferences  to  be  made,  it  can  also  lead  to  false  interpretations 
and  misunderstandings.  A method  for  such  inference  is  logical  abduction 
(Pierce,  1955).  The  central  rule  of  abductive  inference  is: 

a =>-  P 

P 

a 

Whereas  deduction  runs  an  implication  relation  forward,  abduction  runs  it 
backward,  reasoning  from  an  effect  to  a potential  cause.  An  example  of 
abduction  is  the  following: 

All  Acuras  are  fast. 

John’s  cai'  is  fast. 

John’s  cai'  is  an  Acura. 


DEDUCTION 


SOUND 

INFERENCE 


ABDUCTION 


Obviously,  this  may  be  an  incorrect  inference:  John’s  cai'  may  be  made  by 
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another  manufacturer  yet  still  be  fast. 

In  general,  a given  effect  [3  may  have  many  potential  causes  a,.  We 
generally  will  not  want  to  merely  reason  from  a fact  to  a possible  explana- 
tion of  it,  we  want  to  identify  the  best  explanation  of  it.  To  do  this,  we  need  a 
method  for  comparing  the  quality  of  alternative  abductive  proofs.  There  arc  a 
variety  of  strategies  one  could  employ  for  doing  this.  One  possibility  is  to  use 
a probabilistic  model  (Charniak  and  Goldman,  1988;  Charniak  and  Shimony, 
1990),  although  issues  arise  in  choosing  the  appropriate  space  over  which  to 
calculate  these  probabilities,  and  in  finding  a way  to  acquire  them  given  the 
lack  of  a corpus  of  events.  Another  method  is  to  use  a purely  heuristic  strat- 
egy (Charniak  and  McDermott,  1985,  Chapter  10)  indexCharniak,  E.,  such 
as  preferring  the  explanation  with  the  smallest  number  of  assumptions,  or 
choosing  the  explanation  that  uses  the  most  specific  characteristics  of  the  in- 
put. While  such  heuristics  may  be  easy  to  implement,  they  generally  prove 
to  be  too  brittle  and  limiting.  Finally,  a more  general  cost-based  strategy  can 
be  used  which  combines  features  (both  positive  and  negative)  of  the  proba- 
bilistic and  heuristic  approaches.  The  approach  to  abductive  interpretation 
we  illustrate  here,  due  to  Hobbs  et  al.  (1993),  uses  such  a strategy.  To  sim- 
plify the  discussion,  however,  we  will  largely  ignore  the  cost  component  of 
the  system,  keeping  in  mind  that  one  is  nonetheless  necessary. 

Hobbs  et  al.  (1993)  apply  their  method  to  a broad  range  of  problems 
in  language  interpretation;  here  we  focus  on  its  use  in  establishing  discourse 
coherence,  in  which  world  and  domain  knowledge  arc  used  to  determine 
the  most  plausible  coherence  relation  holding  between  utterances.  Let  us 
step  through  the  analysis  that  leads  to  establishing  the  coherence  of  pas- 
sage (18.71).  First,  we  need  axioms  about  coherence  relations  themselves. 
Axiom  (18.78)  states  that  a possible  coherence  relation  is  the  Explanation 
relation;  other  relations  would  have  analogous  axioms. 

(18.78)  (\/ej,ej)Explanation(ej1ej)  D CoherenceRel  (ej  .e  j) 

The  variables  e,  and  ej  represent  the  events  (or  states)  denoted  by  the  two 
utterances  being  related,  and  the  D symbol  is  used  to  denote  the  implica- 
tion relation.  In  this  axiom  and  those  given  below,  quantifiers  always  scope 
over  everything  to  their  right.  This  axiom  tells  us  that,  given  that  we  need 
to  establish  a coherence  relation  between  two  events,  one  possibility  is  to 
abductively  assume  that  the  relation  is  Explanation. 

The  Explanation  relation  requires  that  the  second  utterance  express  the 
cause  of  the  effect  that  the  first  sentence  expresses.  We  can  state  this  as 
axiom  (18.79). 
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(18.79)  ( \/ej,ej)cause(ej,ei ) D Explanation(ei,ej) 

In  addition  to  axioms  about  coherence  relations,  we  also  need  axioms 
representing  general  knowledge  about  the  world.  The  first  axiom  we  use  says 
that  if  someone  is  drunk,  then  others  will  not  want  that  person  to  drive,  and 
that  the  former  causes  the  latter  (for  convenience,  the  state  of  not  wanting  is 
denoted  by  the  diswant  predicate). 

(18.80)  (Vx,y,ei)drunk(ej,x)  D 

(■ 3ej,ek)diswant(ej,y,ek ) A drive(ek,x)  A cause (e,-,  e j) 

Before  we  move  on,  a few  notes  arc  in  order  concerning  this  axiom  and 
the  others  we  will  present.  First,  axiom  (18.80)  is  stated  using  universal 
quantifiers  to  bind  several  of  the  variables,  which  essentially  says  that  in 
all  cases  in  which  someone  is  drunk,  all  people  do  not  want  that  person 
to  drive.  Although  we  might  hope  that  this  is  generally  the  case,  such  a 
statement  is  nonetheless  too  strong.  The  way  in  which  this  is  handled  in 
the  Hobbs  et  al.  system  is  by  including  an  additional  relation,  called  an  etc 
predicate,  in  the  antecedent  of  such  axioms.  An  etc  predicate  represents  all 
the  other  properties  that  must  be  true  for  the  axiom  to  apply,  but  which  arc 
too  vague  to  state  explicitly.  These  predicates  therefore  cannot  be  proven, 
they  can  only  be  assumed  at  a corresponding  cost.  Because  rules  with  high 
assumption  costs  will  be  dispreferred  to  ones  with  low  costs,  the  likelihood 
that  the  rule  applies  can  be  encoded  in  terms  of  this  cost.  Since  we  have 
chosen  to  simplify  our  discussion  by  ignoring  costs,  we  will  similarly  ignore 
the  use  of  etc  predicates. 

Second,  each  predicate  has  what  may  look  like  an  ‘extra’  variable  in 
the  first  argument  position;  for  instance,  the  drive  predicate  has  two  argu- 
ments instead  of  one.  This  variable  is  used  to  reify  the  relationship  denoted 
by  the  predicate  so  that  it  can  be  referred  to  from  argument  places  in  other 
predicates.  For  instance,  reifying  the  drive  predicate  with  the  variable  ty  al- 
lows us  to  express  the  idea  of  not  wanting  someone  to  drive  by  referring  to 
it  in  the  final  argument  of  the  diswant  predicate. 

Picking  up  where  we  left  off,  the  second  world  knowledge  axiom  we 
use  says  that  if  someone  does  not  want  someone  else  to  drive,  then  they  do 
not  want  this  person  to  have  his  car  keys,  since  car  keys  enable  someone  to 
drive. 

(18.81)  (\/x,y,ej,ek)diswant(ej,y,ek)  Adrive(ek,x)  D 

( 3z,ei,em)diswant(ei,y,em ) Ahave(em.x.z)  Acarkeys(z,x)  A 
cause  (ej,ei) 
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The  third  axiom  says  that  if  someone  doesn’t  want  someone  else  to  have 
something,  he  might  hide  it  from  him. 

(18.82)  ( \/x,y,z,ei,ej)diswant(et,y,em ) A have(em,x,z)  D 

( 3en ) hide(en , y, x, z)  A cause (et . en ) 

The  final  axiom  says  simply  that  causality  is  transitive,  that  is,  if  e,  causes  ej 
and  ej  causes  <?/0  then  e/  causes  <?/c. 

(18.83)  ( \/ej,e j,ek)cause(ej.ej ) l\cause[e  j.e^ ) D cause(ej.ek) 

Finally,  we  have  the  content  of  the  utterances  themselves,  that  is,  that 
John  hid  Bill's  car  keys  (from  Bill), 

(18.84)  hicle(e\  . john.bill.ck)  A carkeys(ck. bill) 

and  that  someone  described  using  the  pronoun  ‘he’  was  drunk;  we  will  rep- 
resent the  pronoun  with  the  free  variable  he. 

(18.85)  drunk(e2,he ) 

We  can  now  see  how  reasoning  with  the  content  of  the  utterances  along 
with  the  aforementioned  axioms  allows  the  coherence  of  passage  (18.71)  to 
be  established  under  the  Explanation  relation.  The  derivation  is  summarized 
in  Figure  18.9;  the  sentence  interpretations  arc  shown  in  boxes.  We  staid  by 
assuming  there  is  a coherence  relation,  and  using  axiom  (18.78)  hypothesize 
that  this  relation  is  Explanation, 

(18.86)  Explanation^  1,^2) 

which,  by  axiom  (18.79),  means  we  hypothesize  that 

(18.87)  cause(e 2,^1) 

holds.  By  axiom  (18.83),  we  can  hypothesize  that  there  is  an  intermediate 
cause  e$, 

(18.88)  cause(e 2,^3)  Acause(ej,,e\) 

and  we  can  repeat  this  again  by  expanding  the  first  conjunct  of  (18.88)  to 
have  an  intermediate  cause  e\. 

(18.89)  cause (e 2,^4)  A cause (04. ej) 

We  can  take  the  hide  predicate  from  the  interpretation  of  the  first  sentence  in 
(18.84)  and  the  second  cause  predicate  in  (18.88),  and,  using  axiom  (18.82), 
hypothesize  that  John  did  not  want  Bill  to  have  his  car  keys: 

(18.90)  diswantfo,  john.e^)  t \ havens, bill, ck ) 
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From  this,  the  carkeys  predicate  from  (18.84),  and  the  second  cause  predi- 
cate from  (18.89),  we  can  use  axiom  (18.81)  to  hypothesize  that  John  does 
not  want  Bill  to  drive: 

(18.91)  diswant(e4,john.e  (,)  A drive  (eg,  bill) 

From  this,  axiom  (18.80),  and  the  second  cause  predicate  from  (18.89),  we 
can  hypothesize  that  Bill  was  drunk: 

(18.92)  drunk(e2-bill) 

But  now  we  find  that  we  can  ‘prove’  this  fact  from  the  interpretation  of  the 
second  sentence  if  we  simply  assume  that  the  free  variable  he  is  bound  to 
Bill.  Thus,  the  establishment  of  coherence  has  gone  through,  as  we  have 
identified  a chain  of  reasoning  between  the  sentence  interpretations  - one 
that  includes  unprovable  assumptions  about  axiom  choice  and  pronoun  as- 
signment - that  results  in  cause (e 2,ei),  as  required  for  establishing  the  Ex- 
planation relationship. 


This  derivation  illustrates  a powerful  property  of  coherence  establish- 
ment, namely  its  ability  to  cause  the  hearer  to  infer  information  about  the 
situation  described  by  the  discourse  that  the  speaker  has  left  unsaid.  In  this 
case,  the  derivation  required  the  assumption  that  John  hid  Bill’s  keys  be- 
cause he  did  not  want  him  to  drive  (presumably  out  of  fear  of  him  having 
an  accident,  or  getting  stopped  by  the  police),  as  opposed  to  some  other  ex- 
planation, such  as  playing  a practical  joke  on  him.  This  cause  is  not  stated 
anywhere  in  passage  (18.71);  it  arises  only  from  the  inference  process  trig- 
gered by  the  need  to  establish  coherence.  In  this  sense,  the  meaning  of  a 


696 


Chapter  18.  Discourse 


discourse  is  greater  than  the  sum  of  the  meanings  of  its  parts.  That  is,  a dis- 
course typically  communicates  far  more  information  than  is  contained  in  the 
interpretations  of  the  individual  sentences  that  comprise  it. 

We  now  return  to  passage  (18.72),  repeated  below  as  (18.94),  which 
was  notable  in  that  it  lacks  the  coherence  displayed  by  passage  (18.71),  re- 
peated below  as  (18.93). 

(18.93)  John  hid  Bill’s  car  keys.  He  was  drunk. 

(18.94)  ??  John  hid  Bill’s  car  keys.  He  likes  spinach. 

We  can  now  see  why  this  is:  there  is  no  analogous  chain  of  inference  capable 
of  linking  the  two  utterance  representations,  in  particular,  there  is  no  causal 
axiom  analogous  to  (18.80)  that  says  that  liking  spinach  might  cause  some- 
one to  not  want  you  to  drive.  Without  additional  information  that  can  sup- 
port such  a chain  of  inference  (such  as  the  aforementioned  scenario  in  which 
someone  promised  John  spinach  in  exchange  for  hiding  Bill’s  car  keys),  the 
coherence  of  the  passage  cannot  be  established. 

Because  abduction  is  a form  of  unsound  inference,  it  must  be  possible 
to  subsequently  retract  the  assumptions  made  during  abductive  reasoning, 
defeasible  that  is,  abductive  inferences  arc  defeasible.  For  instance,  if  passage  (18.93) 
was  followed  by  sentence  (18.95), 

(18.95)  Bill’s  cai-  isn’t  here  anyway;  John  was  just  playing  a practical  joke 

on  him. 

the  system  would  have  to  retract  the  original  chain  of  inference  connecting 
the  two  clauses  in  (18.93),  and  replace  it  with  one  utilizing  the  fact  that  the 
hiding  event  was  part  of  a practical  joke. 

In  a more  general  knowledge  base  designed  to  support  a broad  range 
of  inferences,  we  would  probably  want  axioms  that  are  more  general  that 
those  we  used  to  establish  the  coherence  of  passage  (18.93).  For  instance, 
consider  axiom  (18.81),  which  says  that  if  you  do  not  want  someone  to  drive, 
then  you  do  not  want  them  to  have  their  car  keys.  A more  general  form  of  the 
axiom  would  say  that  if  you  do  not  want  someone  to  perform  an  action,  and 
an  object  enables  them  to  perform  that  action,  then  you  do  not  want  them 
to  have  the  object.  The  fact  that  car  keys  enable  someone  to  drive  would 
then  be  encoded  separately,  along  with  many  other  similar  facts.  Likewise, 
axiom  (18.80)  says  that  if  someone  is  drunk,  you  don’t  want  them  to  drive. 
We  might  replace  this  with  an  axiom  that  says  that  if  someone  does  not  want 
something  to  happen,  then  they  don’t  want  something  that  will  likely  cause 
it  to  happen.  Again,  the  facts  that  people  typically  don’t  want  other  people 
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to  get  into  car  accidents,  and  that  drunk  driving  causes  accidents,  would  be 
encoded  separately. 

While  it  is  important  to  have  computational  models  that  shed  light  on 
the  coherence  establishment  problem,  large  barriers  remain  for  employing 
this  and  similar  methods  on  a wide-coverage  basis.  In  particular,  the  large 
number  of  axioms  that  would  be  required  to  encode  all  of  the  necessary 
facts  about  the  world,  and  the  lack  of  a robust  mechanism  for  constraining 
inference  with  such  a large  set  of  axioms,  makes  these  methods  largely  im- 
practical in  practice.  Such  problems  have  come  to  be  informally  known  as 
Al-complete,  a play  on  the  term  NP-complete  in  computer  science.  An  AI-  ai-complete 
complete  problem  is  one  that  essentially  requires  all  of  the  knowledge  - and 
abilities  to  utilize  it  - that  humans  have. 

Other  approaches  to  analyzing  the  coherence  structure  of  a discourse 
have  also  been  proposed.  One  that  has  received  broad  usage  is  Rhetorical 
Structure  Theory  (RST)  (Mann  and  Thompson,  1987a),  which  proposes  a 
set  of  23  rhetorical  relations  that  can  hold  between  spans  of  text  within  a 
discourse.  While  RST  is  oriented  more  toward  text  description  than  inter- 
pretation, it  has  proven  to  be  a useful  tool  for  developing  natural  language 
generation  systems.  RST  is  described  in  more  detail  in  Section  20.4. 

Coherence  and  Coreference  The  reader  may  have  noticed  another  inter- 
esting property  of  the  proof  that  passage  (18.71)  is  coherent.  While  the 
pronoun  he  was  initially  represented  as  a free  variable,  it  got  bound  to  Bill 
during  the  derivation.  In  essence,  a separate  procedure  for  resolving  the 
pronoun  was  not  necessary;  it  happened  as  a side  effect  of  the  coherence  es- 
tablishment procedure.  In  addition  to  the  tree-search  algorithm  presented  on 
page  683,  Hobbs  (1978b)  proposes  this  use  of  the  coherence  establishment 
mechanism  as  a second  approach  to  pronoun  interpretation. 

This  approach  provides  an  explanation  for  why  the  pronoun  in  passage 
(18.71)  is  most  naturally  interpreted  as  referring  to  Bill,  but  the  pronoun  in 
passage  (18.96)  is  most  naturally  interpreted  as  referring  to  John. 

(18.96)  John  lost  Bill’s  car  keys.  He  was  drunk. 

Establishing  the  coherence  of  passage  (18.96)  under  Explanation  requires  an 
axiom  that  says  that  being  drunk  could  cause  someone  to  lose  something. 

Because  such  an  axiom  will  dictate  that  the  person  who  is  drunk  must  be 
the  same  as  the  person  losing  something,  the  free  variable  representing  the 
pronoun  will  become  bound  to  John.  The  only  lexico-syntactic  difference 
between  passages  (18.96)  and  (18.71),  however,  is  the  verb  of  the  first  sen- 
tence. The  grammatical  positions  of  the  pronoun  and  potential  antecedent 
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noun  phrases  arc  the  same  in  both  cases,  so  syntactically-based  preferences 
do  not  distinguish  between  these. 

Discourse  Connectives  Sometimes  a speaker  will  include  a specific  cue, 
called  a connective,  that  serves  to  constrain  the  set  of  coherence  relations 
that  can  hold  between  two  or  more  utterances.  For  example,  the  connec- 
tive because  indicates  the  Explanation  relationship  explicitly,  as  in  passage 

(18.97) . 

(18.97)  John  hid  Bill’s  car  keys  because  he  was  drunk. 

The  meaning  of  because  can  be  represented  as  cause(e 2,e\),  which  would 
play  a similar  role  in  the  proof  as  the  cause  predicate  that  was  introduced 
abductively  via  axiom  (18.79). 

However,  connectives  do  not  always  constrain  the  possibilities  to  a sin- 
gle coherence  relation.  The  meaning  of  and,  for  instance,  is  compatible  with 
the  Parallel,  Occasion,  and  Result  relations  introduced  on  page  690,  as  ex- 
emplified in  (18.98)— (18. 100)  respectively. 

(18.98)  John  bought  an  Acura  and  Bill  leased  a BMW. 

(18.99)  John  bought  an  Acura  and  drove  to  the  ballgame. 

(18. 100)  John  bought  an  Acura  and  his  father  went  ballistic. 

However,  and  is  not  compatible  with  the  Explanation  relation;  unlike  pas- 
sage (18.97),  passage  (18.101)  cannot  mean  the  same  thing  as  (18.71). 

(18. 101)  John  hid  Bill’s  car  keys  and  he  was  drunk. 

While  the  coherence  resolution  procedure  can  use  connectives  to  con- 
strain the  range  of  coherence  relations  that  can  be  inferred  between  a pair  of 
utterances,  they  in  and  of  themselves  do  not  create  coherence.  Any  coher- 
ence relation  indicated  by  a connective  must  still  be  established.  Therefore, 
adding  because  to  example  (18.72),  for  instance,  still  does  not  make  it  co- 
herent. 

(18.102)  ??  John  hid  Bill's  car  keys  because  he  likes  spinach. 

Coherence  establishment  fails  here  for  the  same  reason  it  does  for  example 
(18.72),  that  is,  the  lack  of  causal  knowledge  explaining  how  liking  spinach 
would  cause  one  to  hide  someone’s  car  keys. 
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18.3  Discourse  Structure 


In  the  previous  section,  we  saw  how  the  coherence  of  a pair  of  sentences  can 
be  established.  We  now  ask  how  coherence  can  be  established  for  longer  dis- 
courses. Does  one  simply  establish  coherence  relations  between  all  adjacent 
pairs  of  sentences? 

It  turns  out  that  the  answer  is  no.  Just  as  sentences  have  hierarchical 
structure  (that  is,  syntax),  so  do  discourses.  Consider  passage  (18.103). 

(18.103)  • John  went  to  the  bank  to  deposit  his  paycheck.  (SI) 

• He  then  took  a train  to  Bill’s  car  dealership.  (S2) 

• He  needed  to  buy  a car.  (S3) 

• The  company  he  works  for  now  isn’t  near  any  public 
transportation.  (S4) 

• He  also  wanted  to  talk  to  Bill  about  their  softball  league.  (S5) 

Intuitively,  the  structure  of  passage  (18.103)  is  not  linear.  The  discourse 
seems  to  be  primarily  about  the  sequence  of  events  described  in  sentences 
SI  and  S2,  whereas  sentences  S3  and  S5  are  related  most  directly  to  S2,  and 
S4  is  related  most  directly  to  S3.  The  coherence  relationships  between  these 
sentences  result  in  the  discourse  structure  shown  in  Figure  18.10. 


Each  node  in  the  tree  represents  a group  of  locally  coherent  utterances, 
called  a discourse  segment.  Roughly  speaking,  one  can  think  of  discourse 
segments  as  being  analogous  to  intermediate  constituents  in  sentence  syntax. 

We  can  extend  the  set  of  discourse  interpretation  axioms  used  in  the 
last  section  to  establish  the  coherence  of  larger,  hierarchical  discourses  such 
as  (18.103).  The  recognition  of  discourse  segments,  and  ultimately  discourse 
structure,  results  as  a by-product  of  this  process. 
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First,  we  add  axiom  (18.104),  which  states  that  a sentence  is  a dis- 
course segment.  Here,  w is  the  string  of  words  in  the  sentence,  and  e the 
event  or  state  described  by  it. 

(18.104)  ( \/w,e)sentence(w,e ) D Segment (w,e) 

Next,  we  add  axiom  (18.105),  which  says  that  two  smaller  segments  can 
be  composed  into  a larger  one  if  a coherence  relation  can  be  established 
between  the  two. 

(18.105)  (Vwi,W2,ei,e2,e)  Segment (wi,ei)  A Segment (w2,e2) 

A CoherenceRel\e\.e2,e)  D Segment {w\W2,e) 

Note  that  extending  our  axioms  for  longer  discourses  has  necessitated  that 
we  add  a third  argument  to  the  CoherenceRel  predicate  (e).  The  value  of 
this  variable  will  be  a combination  of  the  information  expressed  by  e\  and  <?2 
that  represents  the  main  assertion  of  the  resulting  segment.  For  our  purposes 
here,  we  will  assume  that  subordinating  relations  such  as  Explanation  pass 
along  only  one  argument  (in  this  case  the  first,  that  is,  the  effect),  whereas 
coordinating  relations  such  as  Parallel  and  Occasion  pass  a combination 
of  both  arguments.  These  arguments  are  shown  in  parentheses  next  to  each 
relation  in  Figure  18.10. 

Now,  to  interpret  a coherent  text  W,  one  must  simply  prove  that  it  is  a 
segment,  as  expressed  by  statement  (18.106). 

(18.106)  (3e)Segment(W,e) 

These  two  rules  will  derive  any  possible  binary  branching  segmental  struc- 
ture for  a discourse,  as  long  as  that  structure  can  be  supported  by  the  estab- 
lishment of  coherence  relations  between  the  segments.  Herein  lies  a differ- 
ence between  computing  the  syntactic  structure  of  a sentence  (see  Chapter  9) 
and  that  of  a discourse.  Sentence-level  grammars  are  generally  complex,  en- 
coding many  syntactic  facts  about  how  different  constituents  (noun  phrases, 
verb  phrases)  can  modify  in  each  other  and  in  what  order.  The  ‘discourse 
grammar’  above,  on  the  contrary,  is  much  simpler,  encoding  only  two  rules: 
a segment  rewrites  to  two  smaller  segments,  and  a sentence  is  a segment. 
Which  of  the  possible  structures  is  actually  assigned  depends  on  how  the 
coherence  of  the  passage  is  established. 

Why  would  we  want  to  compute  discourse  structure?  Several  appli- 
cations could  benefit  from  it.  A summarization  system,  for  instance,  might 
use  it  to  select  only  the  central  sentences  in  the  discourse,  forgoing  the  in- 
clusion of  subordinate  information.  For  instance,  a system  for  creating  brief 
summaries  might  only  include  sentences  SI  and  S2  when  applied  to  pas- 
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sage  (18.103),  since  the  event  representations  for  these  were  propagated  to 
the  top  level  node.  A system  for  creating  more  detailed  summaries  might 
also  include  S3  and  S5.  Similarly,  an  information  retrieval  system  might 
weight  information  in  sentences  that  arc  propagated  to  higher-level  parts  of 
the  discourse  structure  more  heavily  than  information  in  ones  that  arc  not, 
and  generation  systems  need  knowledge  of  discourse  structure  to  create  co- 
herent discourse,  as  described  in  Chapter  20. 

Discourse  structure  may  also  be  useful  for  natural  language  subtasks 
such  as  pronoun  resolution.  We  already  know  from  Section  18.1  that  pro- 
nouns display  a preference  for  recency,  that  is,  they  have  a strong  tendency 
to  refer  locally.  But  now  we  have  two  possible  definitions  for  recency:  re- 
cent in  terms  of  the  linear  order  of  the  discourse,  or  recent  in  terms  of  its 
hierarchical  structure.  It  has  been  claimed  that  the  latter  definition  is  in  fact 
the  correct  one,  although  admittedly  the  facts  arc  not  completely  clear  in  all 
cases. 

In  this  section,  we  have  briefly  described  one  of  several  possible  ap- 
proaches to  recovering  discourse  structure.  A different  approach,  one  typi- 
cally applied  to  dialogues,  will  be  described  in  Section  19.4. 

18.4  Psycholinguistic  Studies  of  Reference  and 
Coherence 


To  what  extent  do  the  techniques  described  in  this  chapter  model  human 
discourse  comprehension?  A substantial  body  of  psycholinguistic  research 
has  studied  this  question. 

For  instance,  a significant  amount  of  work  has  been  concerned  with 
the  extent  to  which  people  use  the  preferences  described  in  Section  18.1  to 
interpret  pronouns,  the  results  of  which  arc  often  contradictory.  Clark  and 
Sengal  (1979)  studied  the  effects  that  sentence  recency  plays  in  pronoun  in- 
terpretation using  a set  of  reading  time  experiments.  After  receiving  and  timeexperi- 
acknowledging  a three  sentence  context  to  read,  human  subjects  were  given 
a target  sentence  containing  a pronoun.  The  subjects  pressed  a button  when 
they  felt  that  they  understood  the  target  sentence.  Clark  and  Sengal  found 
that  the  reading  time  was  significantly  faster  when  the  referent  for  the  pro- 
noun was  evoked  from  the  most  recent  clause  in  the  context  than  when  it 
was  evoked  from  two  or  three  clauses  back.  On  the  other  hand,  there  was  no 
significant  difference  between  referents  evoked  from  two  clauses  and  three 
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clauses  back,  leading  them  to  claim  that  “the  last  clause  processed  grants  the 
entities  it  mentions  a privileged  place  in  working  memory”. 

Crawley  et  al.  (1990)  compared  the  grammatical  role  parallelism  pref- 
erence with  a grammatical  role  preference,  in  particular,  a preference  for  ref- 
erents evoked  from  the  subject  position  of  the  previous  sentence  over  those 
evoked  from  object  position.  Unlike  previous  studies  which  conflated  these 
preferences  by  considering  only  subject-to-subject  reference  effects,  Craw- 
ley et  al.  studied  pronouns  in  object  position  to  see  if  they  tended  to  be  as- 
signed to  the  subject  or  object  of  the  last  sentence.  They  found  that  in  two 
task  environments  - a question  answering  task  which  revealed  how  the  hu- 
man subjects  interpreted  the  pronoun,  and  a referent  naming  task  in  which 
the  subjects  identified  the  referent  of  the  pronoun  directly  - the  human  sub- 
jects resolved  pronouns  to  the  subject  of  the  previous  sentence  more  often 
than  the  object. 

However,  Smyth  (1994)  criticized  the  adequacy  of  Crawley  et  al.’s  data 
for  evaluating  the  role  of  parallelism.  Using  data  that  met  more  stringent  re- 
quirements for  assessing  parallelism,  Smyth  found  that  subjects  overwhelm- 
ingly followed  the  parallelism  preference  in  a referent  naming  task.  The 
experiment  supplied  weaker  support  for  the  preference  for  subject  referents 
over  object  referents,  which  he  posited  as  a default  strategy  when  the  sen- 
tences in  question  arc  not  sufficiently  parallel. 

Caramazza  et  al.  (1977)  studied  the  effect  of  the  ‘implicit  causality’ 
of  verbs  on  pronoun  resolution.  Verbs  were  categorized  in  terms  of  having 
subject  bias  or  object  bias  using  a sentence  completion  task.  Subjects  were 
given  sentence  fragments  such  as  (18.107). 

(18.107)  John  telephoned  Bill  because  he 

The  subjects  provided  completions  to  the  sentences,  which  identified  to  the 
experimenters  what  referent  for  the  pronoun  they  favored.  Verbs  for  which 
a large  percentage  of  human  subjects  indicated  a grammatical  subject  or  ob- 
ject preference  were  categorized  as  having  that  bias.  A sentence  pair  was 
then  constructed  for  each  biased  verb:  a ‘congruent’  sentence  in  which  the 
semantics  supported  the  pronoun  assignment  suggested  by  the  verb’s  bias, 
and  an  ‘incongruent’  sentence  in  which  the  semantics  supported  the  opposite 
prediction.  For  example,  sentence  (18.108)  is  congruent  for  the  subject-bias 
verb  ‘telephoned’,  since  the  semantics  of  the  second  clause  supports  assign- 
ing the  subject  John  as  the  antecedent  of  he,  whereas  sentence  (18.109)  is 
incongruent  since  the  semantics  supports  assigning  the  object  Bill. 

(18.108)  John  telephoned  Bill  because  he  wanted  some  information. 
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(18.109)  John  telephoned  Bill  because  he  withheld  some  information. 


In  a referent  naming  task,  Caramazza  et  al.  found  that  naming  times  were 
faster  for  the  congruent  sentences  than  for  the  incongruent  ones.  Perhaps 
surprisingly,  this  was  even  true  for  cases  in  which  the  two  people  mentioned 
in  the  first  clause  were  of  different  genders  (e.g.,  change  John  to  Sue  in 
examples  (18.108)  and  (18.109)),  thus  rendering  the  reference  unambiguous. 

Garnham  et  al.  (1996)  differentiated  between  two  hypotheses  about  the 
manner  in  which  implicit  causality  might  affect  pronoun  resolution:  the  fo- 
cus hypothesis,  which  says,  as  might  be  suggested  by  the  Caramazza  et  al. 
experiments,  that  such  verbs  have  a priming  effect  on  the  filler  of  a particu- 
lar grammatical  role  and  thus  contribute  information  that  can  be  used  at  the 
point  at  which  the  pronoun  is  interpreted,  and  the  integration  hypothesis,  in 
which  this  information  is  only  used  after  the  clause  has  been  comprehended 
and  is  being  integrated  with  the  previous  discourse.  They  attempted  to  de- 
termine which  hypothesis  is  correct  using  a probing  task.  After  sentences 
were  presented  to  establish  a context,  a sentence  containing  a pronoun  was 
presented  one  word  at  a time.  At  appropriate  points  during  the  presenta- 
tion, the  name  of  one  of  the  possible  referents  was  displayed,  and  the  subject 
asked  whether  that  person  has  been  mentioned  in  the  sentence  so  far.  Gar- 
nham et  al.  found  that  the  implicit  causality  information  bias  was  generally 
not  available  right  after  the  pronoun  was  given,  but  was  utilized  later  in  the 
sentence. 

Matthews  and  Chodorow  (1988)  analyzed  the  problem  of  intrasenten- 
tial  reference  and  the  predictions  of  syntactically-based  search  strategics.  In 
a question  answering  task,  they  found  that  subjects  exhibited  slower  com- 
prehension times  for  sentences  in  which  a pronoun  antecedent  occupied  an 
early,  syntactically  deep  position  than  for  sentences  in  which  the  antecedent 
occupied  a late,  syntactically  shallow  position.  This  result  is  consistent  with 
the  search  process  used  in  Hobbs’s  tree  search  algorithm. 

There  has  also  been  psycholinguistic  work  concerned  with  testing  the 
principles  of  centering  theory.  In  a set  of  reading  time  experiments,  Gor- 
don et  al.  (1993)  found  that  reading  times  were  slower  when  the  current 
backward-looking  center  was  referred  to  using  a full  noun  phrase  instead 
of  a pronoun,  even  though  the  pronouns  were  ambiguous  and  the  proper 
names  were  not.  This  effect  - which  they  called  a repeated  name  penalty 
- was  found  only  for  referents  in  subject  position,  suggesting  that  the  C/,  is 
preferentially  realized  as  a subject.  Brennan  (1995)  analyzed  how  choice 
of  linguistic  form  correlates  with  centering  principles.  She  ran  a set  of  ex- 
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periments  in  which  a human  subject  watched  a basketball  game  and  had  to 
describe  it  to  a second  person.  She  found  that  the  human  subjects  tended  to 
refer  to  an  entity  using  a full  noun  phrase  in  subject  position  before  subse- 
quently pronominalizing  it,  even  if  the  referent  had  already  been  introduced 
in  object  position. 

Psycholinguistic  studies  have  also  addressed  the  processes  people  use 
to  establish  discourse  coherence.  Some  of  this  work  has  focussed  on  the 
question  of  inference  control,  that  is,  which  of  the  potentially  infinite  num- 
ber of  possible  inferences  arc  actually  made  during  interpretation  (Singer, 
1994;  Garrod  and  Sanford,  1994).  These  can  be  categorized  in  terms  of  be- 
ing necessary  inferences,  those  which  arc  necessary  to  establish  coherence, 
and  elaborative  inferences,  those  which  arc  suggested  by  the  text  but  not 
necessary  for  establishing  coherence.  The  position  that  only  necessary  infer- 
ences arc  made  during  interpretation  has  been  called  the  deferred  inference 
theory  (Garnham,  1985)  and  the  minimalist  position  (McKoon  and  Ratcliff, 
1992).  As  with  pronoun  interpretation,  results  of  studies  testing  these  ques- 
tions have  yielded  potentially  contradictory  results.  Indeed,  the  results  in 
each  case  depend  to  a large  degree  on  the  experimental  setup  and  paradigm 
(Keenan  et  al.,  1990). 

Johnson  et  al.  (1973),  for  instance,  examined  this  question  using  a 
recognition  judgement  task.  They  presented  subjects  with  passages  such 
as  (18.110). 

(18.1 10)  When  the  man  entered  the  kitchen  he  slipped  on  a wet  spot  and 

dropped  the  delicate  glass  pitcher  on  the  floor.  The  pitcher  was  very 

expensive,  and  everyone  watched  the  event  with  horror. 

The  subjects  were  subsequently  presented  either  with  a sentence  taken  di- 
rectly from  one  of  the  passages,  such  as  the  first  sentence  of  (18.110),  or 
one  that  included  an  elaborative  inference  in  the  form  of  an  expected  conse- 
quence such  as  (18.111).  The  subjects  were  then  asked  if  the  sentence  had 
appeared  verbatim  in  one  of  the  passages. 

(18.1 11)  The  man  broke  the  delicate  glass  pitcher  on  the  floor. 

Both  types  of  sentence  received  a recognition  rate  in  the  mid-60%  range, 
whereas  control  sentences  that  substantially  altered  the  meaning  were  rec- 
ognized much  less  often  (about  22%).  By  running  a similar  experiment  that 
also  measured  subjects’  response  times,  Singer  (1979)  addressed  the  ques- 
tion of  whether  these  inferences  were  made  at  the  time  the  original  sentence 
was  comprehended  (and  thus  truly  elaborative),  or  at  the  time  that  the  ex- 
pected consequence  version  was  presented.  While  Singer  also  found  that  the 
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identical  and  expected  consequence  versions  yield  similar  rates  of  positive 
responses,  the  judgements  about  the  consequence  versions  took  0.2-0.3  sec- 
onds longer  than  for  the  identical  sentences,  suggesting  that  the  inference 
was  not  made  at  comprehension  time. 

Singer  (1980)  examined  the  question  of  when  different  types  of  infer- 
ences were  made  using  passages  such  as  (18.1 12)-(18.1 14). 

(18.1 12)  The  dentist  pulled  the  tooth  painlessly.  The  patient  liked  the  new 
method. 

(18.113)  The  tooth  was  pulled  painlessly.  The  dentist  used  a new  method. 

(18.1 14)  The  tooth  was  pulled  painlessly.  The  patient  liked  the  new 
method. 

Each  of  these  passages  was  presented  to  the  subject,  followed  by  the  test 
sentence  given  in  (18.115). 

(18.1 15)  A dentist  pulled  the  tooth. 

The  information  expressed  in  (18.1 15)  is  mentioned  explicitly  in  (18.1 12),  is 
necessary  to  establish  coherence  in  (18.113),  and  is  elaborative  in  (18.114). 

Singer  found  that  subject  verification  times  were  approximately  the  same  in 
the  first  two  cases,  but  0.25  seconds  slower  in  the  elaborative  case,  adding 
support  to  the  deferred  inference  theory. 

Kintsch  and  colleagues  have  proposed  and  analyzed  a ‘construction- 
integration’  model  of  discourse  comprehension  (Kintsch  and  van  Dijk,  1978; 
van  Dijk  and  Kintsch,  1983;  Kintsch,  1988).  They  defined  the  concept  of  a 

TEXT 

text  macrostructure,  which  is  a hierarchical  network  of  propositions  that  macrostruc- 
provides  an  abstract,  semantic  description  of  the  global  content  of  the  text. 

Guindon  and  Kintsch  (1984)  evaluated  whether  the  elaborative  inferences 
necessary  to  construct  the  macrostructure  accompany  comprehension  pro- 
cesses, using  a lexical  priming  technique.  Subjects  read  a passage  and  then  pr^ng 
were  asked  if  a particular-  word  pair  was  present  in  the  text.  Three  types  of 
word  pairs  were  used:  pairs  that  were  not  mentioned  in  the  text  but  were 
related  to  the  text  macrostructure,  pairs  of  ‘distractor  words’  that  were  the- 
matically related  to  the  text  but  not  the  macrostructure,  and  pairs  of  themat- 
ically unrelated  distractor  words.  The  number  of  ‘false  alarms’  - in  which  a 
subject  erroneously  indicated  that  the  words  appeared  in  the  text  - was  sig- 
nificantly higher  for  macrostructure  pairs  than  for  thematically  related  pairs, 
which  in  turn  was  higher  than  for  pairs  of  thematically  unrelated  words.  In 
the  remaining  cases  - in  which  the  subjects  correctly  rejected  word  pairs  that 
did  not  appeal-  - response  times  were  significantly  longer  for  macrostructure 
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words  than  thematically  related  pairs,  which  in  turn  were  higher  than  for 
thematically  unrelated  words. 

Myers  et  al.  (1987)  considered  the  question  of  how  the  degree  of  causal 
relatedness  between  sentences  affects  comprehension  times  and  recall  accu- 
racy. Considering  a target  sentence  such  as  (18.116). 

(18.1 16)  She  found  herself  too  frightened  to  move. 

they  designed  four  context  sentences,  shown  in  (18.1 17)— (1 8. 120),  which 
form  a continuum  moving  from  high  to  low  causal  relatedness  to  (18. 116). 

(18.1 17)  Rose  was  attacked  by  a man  in  her  apartment. 

(18.1 18)  Rose  saw  a shadow  at  the  end  of  the  hall. 

(18.1 19)  Rose  entered  her  apartment  to  find  a mess. 

(18.120)  Rose  came  back  to  her  apartment  after  work. 

Subjects  were  presented  with  cause-effect  sentence  pairs  consisting  of  a con- 
text sentence  and  the  target  sentence.  Myers  et  al.  found  that  reading  times 
were  faster  for  more  causally  related  pairs.  After  the  subjects  had  seen  a 
cued  recall  number  of  such  pairs,  Myers  et  al.  then  ran  a cued  recall  experiment,  in 
which  the  subjects  were  given  one  sentence  from  a pair  and  asked  to  recall 
as  much  as  possible  about  the  other  sentence  in  the  pair.  They  found  that  the 
subjects  recalled  more  content  for  more  causally  related  sentence  pairs. 


18.5  Summary 

In  this  chapter,  we  saw  that  many  of  the  problems  that  natural  language  pro- 
cessing systems  face  operate  between  sentences,  that  is,  at  the  discourse 
level.  Here  is  a summary  of  some  of  the  main  points  we  discussed: 

• Discourse  interpretation  requires  that  one  build  an  evolving  represen- 
tation of  discourse  state,  called  a discourse  model,  that  contains  repre- 
sentations of  the  entities  that  have  been  referred  to  and  the  relationships 
in  which  they  participate. 

• Natural  languages  offer  many  ways  to  refer  to  entities.  Each  form  of 
reference  sends  its  own  signals  to  the  hearer  about  how  it  should  be 
processed  with  respect  to  her  discourse  model  and  set  of  beliefs  about 
the  world. 

• Pronominal  reference  can  be  used  for  referents  that  have  an  adequate 
degree  of  salience  in  the  discourse  model.  There  are  a variety  of  lex- 
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ical,  syntactic,  semantic,  and  discourse  factors  that  appeal-  to  affect 
salience. 

• These  factors  can  be  modeled  and  weighed  against  each  other  in  a pro- 
noun interpretation  algorithm,  due  to  Lappin  and  Leass  (1994),  that 
achieves  performance  in  the  mid-80%  range  on  some  genres. 

• Discourses  are  not  arbitrary  collections  of  sentences;  they  must  be  co- 
herent. Collections  of  well-formed  and  individually  interpretable  sen- 
tences often  form  incoherent  discourses  when  juxtaposed. 

• The  process  of  establishing  coherence,  performed  by  applying  the  con- 
straints imposed  by  one  or  more  coherence  relations , often  leads  to  the 
inference  of  additional  information  left  unsaid  by  the  speaker.  The 
unsound  rule  of  logical  abduction  can  be  used  for  performing  such  in- 
ference. 

• Discourses,  like  sentences,  have  hierarchical  structure.  Intermediate 
groups  of  locally  coherent  utterances  are  called  discourse  segments. 
Discourse  structure  recognition  can  be  viewed  as  a by-product  of  dis- 
course interpretation. 


Bibliographical  and  Historical  Notes 

Building  on  the  foundations  set  by  early  systems  for  natural  language  under- 
standing (Woods  et  al.,  1972;  Winograd,  1972b;  Woods,  1978),  much  of  the 
fundamental  work  in  computational  approaches  to  discourse  was  performed 
in  the  late  70's.  Webber’s  (1978,  1983)  work  provided  fundamental  insights 
into  how  entities  are  represented  in  the  discourse  model  and  the  ways  in 
which  they  can  license  subsequent  reference.  Many  of  the  examples  she  pro- 
vided continue  to  challenge  theories  of  reference  to  this  day.  Grosz  (1977b) 
addressed  the  focus  of  attention  that  conversational  participants  maintain  as 
the  discourse  unfolds.  She  defined  two  levels  of  focus;  entities  relevant  to 
the  entire  discourse  were  said  to  be  in  global  focus,  whereas  entities  that  are 
locally  in  focus  (i.e.,  most  central  to  a particular  utterance)  were  said  to  be 
in  immediate  focus.  Sidner  (1979,  1983b)  described  a method  for  tracking 
(immediate)  discourse  foci  and  their  use  in  resolving  pronouns  and  demon- 
strative noun  phrases.  She  made  a distinction  between  the  current  discourse 
focus  and  potential  foci,  which  are  the  predecessors  to  the  backward  and 
forward  looking  centers  of  centering  theory  respectively. 
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The  roots  of  the  centering  approach  originate  from  papers  by  Joshi  and 
Kuhn  (1979)  and  Joshi  and  Weinstein  (1981),  who  addressed  the  relation- 
ship between  immediate  focus  and  the  inferences  required  to  integrate  the 
current  utterance  into  the  discourse  model.  Grosz  et  al.  (1983)  integrated 
this  work  with  the  prior  work  of  Sidner  and  Grosz.  This  led  to  a manuscript 
on  centering  which,  while  widely  circulated  since  1986,  remained  unpub- 
lished until  Grosz  et  al.  (1995).  A series  of  papers  on  centering  based  on  this 
manuscript/paper  were  subsequently  published  (Kameyama,  1986;  Brennan 
etal,  1987;  Di  Eugenio,  1990;  Walker  etal,  1994;  Di  Eugenio,  1996;  Strube 
and  Hahn,  1996;  Kehler,  1997a,  inter  alia)  indexDi  Eugenio,  B.  indexStrube, 
M..  A collection  of  more  recent  centering  papers  appeal's  in  Walker  et  al. 
(1998). 

Researchers  in  the  linguistics  community  have  proposed  accounts  of 
the  information  status  that  referents  hold  in  a discourse  model  (Chafe,  1976; 
Prince,  1981;  Ariel,  1990;  Prince,  1992;  Gundel  et  al,  1993;  Lambrecht, 
1994,  inter  alia).  Prince  (1992),  for  instance,  analyzes  information  status 
in  terms  of  two  crosscutting  dichotomies:  hearer  status  and  discourse  sta- 
tus, and  shows  how  these  statuses  correlate  with  the  grammatical  position 
of  referring  expressions.  Gundel  et  al.  (1993),  on  the  other  hand,  posits  a 
unidimensional  scale  with  six  statuses  (called  the  givenness  hierarchy),  and 
correlates  them  with  the  linguistic  form  of  referring  expressions. 

Beginning  with  Hobbs’s  (1978b)  tree-search  algorithm,  researchers 
have  pursued  syntax-based  methods  for  identifying  reference  robustly  in  nat- 
urally occurring  text.  Building  on  the  work  of  Lappin  and  Leass  (1994), 
Kennedy  and  Boguraev  (1996)  describe  a similar  system  that  does  not  rely 
on  a full  syntactic  parser,  but  merely  a mechanism  for  identifying  noun 
phrases  and  labeling  their  grammatical  roles.  Both  approaches  use  Alshawi’s 
(1987)  framework  for  integrating  salience  factors.  An  algorithm  that  uses 
this  framework  for  resolving  references  in  a multimodal  (i.e.,  speech  and 
gesture)  human-computer  interface  is  described  in  Huls  et  al.  (1995).  A dis- 
cussion of  a variety  of  approaches  to  reference  in  operational  systems  can  be 
found  in  Mitkov  and  Boguraev  (1997). 

Recently,  several  researchers  have  pursued  methods  for  reference  res- 
olution based  on  supervised  learning  (Connolly  et  al.,  1994;  Aone  and  Ben- 
nett, 1995;  McCarthy  and  Lehnert,  1995;  Kehler,  1997b;  Ge  et  al,  1998, 
inter  alia).  In  these  studies,  machine  learning  methods  such  as  Bayesian 
model  induction,  decision  trees,  and  maximum  entropy  modeling  were  used 
to  train  models  from  corpora  annotated  with  coreference  relations.  A discus- 
sion of  some  issues  that  arise  in  annotating  corpora  for  coreference  can  be 
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found  in  Poesio  and  Vieira  (1998). 

The  MUC-6  information  extraction  evaluation  included  a common  eval- 
uation on  coreference  (Sundheim,  1995  a).  The  task  included  coreference 
between  proper  names,  aliases,  definite  noun  phrases,  hare  nouns,  pronouns, 
and  even  coreference  indicated  by  syntactic  relations  such  predicate  nomi- 
nals  (“The  Integra  is  the  world’s  nicest  looking  car”)  and  appositives  (“the 
Integra , the  world’s  nicest  looking  car,”).  Performance  was  evaluated  by 
calculating  recall  and  precision  statistics  based  on  the  distance  between  the 
equivalence  classes  of  coreferent  descriptions  produced  by  a system  and 
those  in  a human-annotated  answer  key.  Five  of  the  seven  sites  which  partic- 
ipated in  the  evaluation  achieved  in  the  range  of  51%-63%  recall  and  62%- 
72%  precision.  A similar  evaluation  was  also  included  as  paid  of  MUC-7. 

Several  researchers  have  posited  sets  of  coherence  relations  that  can 
hold  between  utterances  in  a discourse  (Halliday  and  Hasan,  1976;  Hobbs, 
1979a;  Longacre,  1983;  Mann  and  Thompson,  1987a;  Polanyi,  1988;  Hobbs, 
1990;  Sanders  et  al.,  1992,  inter  alia).  A compendium  of  over  350  rela- 
tions that  have  been  proposed  in  the  literature  can  be  found  in  Hovy  (1990). 
The  Linguistic  Discourse  Model  (Polanyi,  1988;  Scha  and  Polanyi,  1988) 
is  a framework  in  which  discourse  syntax  is  more  heavily  emphasized;  in 
this  approach,  a discourse  parse  tree  is  built  on  a clause-by-clause  basis  in 
direct  analogy  with  how  a sentence  parse  tree  is  built  on  a constituent-by- 
constituent  basis.  A more  recent  line  of  work  has  applied  a version  of  the 
tree-adjoining  grammar  formalism  to  discourse  parsing  (Webber  et  ah,  1999, 
and  citations  therein).  In  addition  to  determining  discourse  structure  and 
meaning,  theories  of  discourse  coherence  have  been  used  in  algorithms  for 
interpreting  discourse-level  linguistic  phenomena,  including  pronoun  resolu- 
tion (Hobbs,  1979a;  Kehler,  2000),  verb  phrase  ellipsis  and  gapping  (Priist, 
1992;  Asher,  1993;  Kehler,  1993,  1994a),  and  tense  interpretation  (Las- 
carides  and  Asher,  1993;  Kehler,  1994b,  2000).  An  extensive  investigation 
into  the  relationship  between  coherence  relations  and  discourse  connectives 
can  be  found  in  Knott  and  Dale  (1994). 


Exercises 


18.1  Early  work  in  syntactic  theory  attempted  to  characterize  rules  for 
pronominalization  through  purely  syntactic  means.  A rule  was  proposed  in 
which  a pronoun  was  interpreted  by  deleting  it  from  the  syntactic  structure 
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of  the  sentence  that  contains  it,  and  replacing  it  with  the  syntactic  represen- 
tation of  the  antecedent  noun  phrase. 

Explain  why  the  following  sentences  (called  “Bach-Peters”  sentences) 
arc  problematic  for  such  an  analysis. 

(18.121)  The  man  who  deserves  it  gets  the  prize  he  wants. 

(18.122)  The  pilot  who  shot  at  it  hit  the  MIG  that  chased  him. 

What  other  types  of  reference  discussed  on  pages  667-672  are  problematic 
for  this  type  of  analysis? 

Now,  consider  the  following  example  (Karttunen,  1969). 

(18.123)  The  student  who  revised  his  paper  did  better  than  the  student  who 
handed  it  in  as  is. 

What  is  the  preferred  reading  for  the  pronoun  it,  and  why  is  it  different  and 
interesting?  Describe  why  the  syntactic  account  described  above  can  be  seen 
to  predict  this  reading.  Is  this  type  of  reading  common?  Construct  some 
superficially  similar  examples  that  nonetheless  appeal-  not  to  have  a similar 
reading. 

18.2  Webber  (1978)  offers  examples  in  which  the  same  referent  appeal's  to 
support  either  singular  or  plural  agreement: 

(18.124)  John  gave  Mary  five  dollars.  It  was  more  than  he  gave  Sue. 

(18.125)  John  gave  Mary  five  dollars.  One  of  them  was  counterfeit. 

What  might  account  for  this?  Describe  how  representations  of  referents  like 
five  dollars  in  the  discourse  model  could  be  made  to  allow  such  behavior. 

Next,  consider  the  following  examples  (from  Webber  and  Baldwin 
(1992)): 

(18.126)  John  made  a handbag  from  an  inner  tube. 

a.  He  sold  it  for  twenty  dollars. 

b.  He  had  taken  it  from  his  brother’s  car. 

c.  Neither  of  them  was  particularly  useful. 

d.  * He  sold  them  for  fifty  dollars. 

Why  is  plural  reference  to  the  handbag  and  the  inner  tube  possible  in  sen- 
tence (18.126c),  but  not  (18. 126d)?  Again,  discuss  how  representations  in 
the  discourse  model  could  be  made  to  support  this  behavior. 

18.3  Draw  syntactic  trees  for  example  (1 8.68)  on  page  68 1 and  apply  Hobbs’s 
tree  search  algorithm  to  it,  showing  each  step  in  the  search. 
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18.4  Recall  that  Hobbs’s  algorithm  does  not  have  an  explicit  representa- 
tion of  a discourse  model,  salience,  or  preferences.  Discuss  which  of  the 
preferences  we  have  described  arc  approximated  by  the  search  process  over 
syntactic  representations  as  Hobbs  has  defined  it,  and  how. 

18.5  Hobbs  (1977)  cites  the  following  examples  from  his  corpus  as  being 
problematic  for  his  tree-search  algorithm. 

(18.127)  The  positions  of  pillars  in  one  hall  were  marked  by  river  boulders 
and  a shaped  convex  cushion  of  bronze  that  had  served  as  their 
footings. 

(18.128)  They  were  at  once  assigned  an  important  place  among  the  scanty 
remains  which  record  the  physical  developments  of  the  human  race 
from  the  time  of  its  first  appearance  in  Asia. 

(18.129)  Sites  at  which  the  coarse  grey  pottery  of  the  Shang  period  has 
been  discovered  do  not  extend  far  beyond  the  southernmost  reach  of 
the  Yellow  river,  or  westward  beyond  its  junction  with  the  Wei. 

(18.130)  The  thin,  hard,  black-burnished  pottery,  made  in  shapes  of  angular 
profile,  which  archeologists  consider  as  the  clearest  hallmark  of  the 
Lung  Shan  culture,  developed  in  the  east.  The  site  from  which  it  takes 
its  name  is  in  Shantung,  fi  is  traced  to  the  north-east  as  far  as 
Liao-ning  province. 

(18.131)  He  had  the  duty  of  performing  the  national  sacrifices  to  heaven 
and  earth:  his  role  as  source  of  honours  and  material  rewards  for 
services  rendered  by  feudal  lords  and  ministers  is  commemorated  in 
thousands  of  inscriptions  made  by  the  recipients  on  bronze  vessels 
which  were  eventually  deposited  in  their  graves. 

In  each  case,  identify  the  correct  referent  of  the  underlined  pronoun  and  the 
one  that  the  algorithm  will  incorrectly  identify.  Discuss  any  factors  that  come 
into  play  in  determining  the  correct  referent  in  each  case,  and  what  types  of 
information  might  be  necessary  to  account  for  them. 

18.6  Consider  the  following  passage,  from  Brennan  et  al.  (1987): 

(18.132)  Brennan  drives  an  Alfa  Romeo. 

She  drives  too  fast. 

Friedman  races  her  on  weekends. 

She  goes  to  Laguna  Seca. 

Identify  the  referent  that  the  BFP  algorithm  finds  for  the  pronoun  in  the  final 
clause.  Do  you  agree  with  this  choice,  or  do  you  find  the  example  ambigu- 
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ous?  Discuss  why  introducing  a new  noun  phrase  in  subject  position,  with 
a pronominalized  reference  in  object  position,  might  lead  to  an  ambiguity. 
What  preferences  arc  competing  here? 

18.7  The  approaches  to  pronoun  resolution  discussed  in  this  chapter  de- 
pend on  accurate  parsing:  Hobbs’s  tree  search  algorithm  assumes  a full 
syntactic  tree,  and  Lappin  and  Leass’s  algorithm  and  centering  requires  that 
grammatical  roles  arc  assigned  correctly.  Given  the  current  state  of  the  art 
in  syntactic  processing,  highly  accurate  syntactic  structures  arc  currently  not 
reliably  computable.  Therefore,  real-world  algorithms  must  choose  between 
one  of  two  options:  (i)  use  a parser  to  generate  (often  inaccurate)  syntactic 
analyses  and  use  them  as  such,  or  (ii)  to  eschew  full  syntactic  analysis  al- 
together and  base  the  algorithm  on  partial  syntactic  analysis,  such  as  noun 
phrase  recognition.  The  Lappin  and  Leass  system  took  the  first  option,  us- 
ing a highly  developed  parser.  However,  one  could  take  the  second  option, 
and  augment  their  algorithm  so  that  surface  position  is  used  to  approximate 
a grammatical  role  hierarchy. 

Design  a set  of  preferences  for  the  Lappin  and  Leass  method  that  as- 
sumes that  only  noun  phrases  arc  bracketed  in  the  input.  Construct  six  exam- 
ples: (i)  two  that  arc  handled  by  both  methods,  (ii)  two  examples  that  Lappin 
and  Leass  handle  but  that  arc  not  handled  by  your  adaptation,  and  (iii)  two 
that  arc  not  handled  correctly  by  either  algorithm.  Make  sure  the  examples 
arc  nontrivially  different. 

18.8  Consider  passages  (18.133a-b),  adapted  from  Winograd  (1972b). 

(18.133)  The  city  council  denied  the  demonstrators  a permit  because 

a.  they  feared  violence. 

b.  they  advocated  violence. 

What  arc  the  correct  interpretations  for  the  pronouns  in  each  case?  Sketch 
out  an  analysis  of  each  in  the  interpretation  as  abduction  framework,  in 
which  these  reference  assignments  arc  made  as  a by-product  of  establish- 
ing the  Explanation  relation. 

18.9  Coherence  relations  may  also  apply  temporal  constraints  to  the  events 
or  states  denoted  by  sentences  in  a discourse.  These  constraints  must  be  com- 
patible with  the  temporal  information  indicated  by  the  tenses  used.  Consider 
the  two  follow-on  sentences  in  example  (18.134). 

(18. 134)  John  got  in  a car  accident. 

a.  He  drank  a six-pack  of  beer. 
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b.  He  had  drunk  a six-pack  of  beer. 

In  what  order  do  the  events  occur  in  each  case?  What  coherence  relation  is 
operative  in  each  case?  Discuss  what  might  account  for  this  difference  given 
the  fact  that  causes  precede  effects. 

18.10  The  coherence  relations  Result  and  Explanation  arc  highly  related, 
in  that  Explanation  is  essentially  the  same  as  Result  except  with  the  oppo- 
site ordering  of  clauses.  These  two  relations  arc  exemplified  in  examples 

(18.135)  and  (18.136). 

(18.135)  Bill  was  drunk.  John  hid  his  car  keys. 

(18.136)  John  hid  Bill’s  car  keys.  He  was  drunk. 

Now  consider  the  following  examples: 

(18.137)  Bill  was  drunk,  but  John  didn't  hide  his  car  keys. 

(18.138)  John  hid  Bill’s  car  keys,  even  though  he  wasn’t  drunk. 

The  coherence  relations  underlying  examples  (18.137)  and  (18.138)  have 
been  called  Violated  Expectation  and  Denial  of  Preventer  respectively. 

Define  the  constraints  that  these  two  relations  impose,  using  those  for 
Result  and  Explanation  as  a guide.  Discuss  how  we  might  consider  all  four 
relations  to  be  parameterized  versions  of  a single  relation. 

18.11  Select  an  editorial  column  from  your  favorite  newspaper,  and  deter- 
mine the  discourse  structure  for  a 10-20  sentence  portion.  What  problems 
did  you  encounter?  Were  you  helped  by  superficial  cues  the  speaker  included 
(e.g.,  discourse  connectives)  in  any  places? 


DIALOGUE  AND 

CONVERSATIONAL 

AGENTS 


C:  I want  you  to  tell  me  the  names  of  the  fellows  on  the  St. 
Louis  team. 

A:  I’m  telling  you.  Who’s  on  first,  What’s  on  second,  I Don’t 
Know  is  on  third. 

C:  You  know  the  fellows’  names? 

A:  Yes. 

C:  Well,  then,  who’s  playing  first? 

A:  Yes. 

C:  I mean  the  fellow’s  name  on  first. 

A:  Who. 

C:  The  guy  on  first  base. 

A:  Who  is  on  first. 

C:  Well  what  are  you  askin’  me  for? 

A:  I’m  not  asking  you  - I’m  telling  you.  Who  is  on  first. 

Who ’s  on  First  - Bud  Abbott  and  Lou  Costello’s  version 
of  an  old  burlesque  standard. 


The  literature  of  the  fantastic  abounds  in  inanimate  objects  magically 
endowed  with  sentience  and  the  gift  of  speech.  From  Ovid’s  statue  of  Pyg- 
malion to  Mary  Shelley’s  Frankenstein,  Cao  Xue  Qin’s  Divine  Lumines- 
cent Stone-in-Waiting  in  the  Court  of  Sunset  Glow  to  Snow  White’s  mirror, 
there  is  something  deeply  touching  about  creating  something  and  then  hav- 
ing a chat  with  it.  Legend  has  it  that  after  finishing  his  sculpture  of  Moses, 
Michelangelo  thought  it  so  lifelike  that  he  tapped  it  on  the  knee  and  com- 
manded it  to  speak.  Perhaps  this  shouldn’t  be  surprising.  Language  itself 
has  always  been  the  mark  of  humanity  and  sentience,  and  conversation  or  tionVERSE 
dialogue  is  the  most  fundamental  and  specially  privileged  arena  of  language,  dialogue 
It  is  certainly  the  first  kind  of  language  we  learn  as  children,  and  for  most  of 
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us,  it  is  the  kind  of  language  we  most  commonly  indulge  in,  whether  we  arc 
ordering  curry  for  lunch  or  buying  postage  stamps,  participating  in  business 
meetings  or  talking  with  our  families,  booking  airline  flights  or  complaining 
about  the  weather. 

This  chapter  introduces  the  fundamental  structures  and  algorithms  in 
conversational  agents,  programs  which  communicate  with  users  in  natural 
language  in  order  to  book  airline  flights,  answer  questions,  or  act  as  a tele- 
phone interface  to  email.  Many  of  these  issues  arc  also  relevant  for  business 
meeting  summarization  systems  and  other  spoken  language  understanding 
systems  which  must  transcribe  and  summarize  structured  conversations  like 
meetings.  Section  19.1  begins  by  introducing  some  issues  that  make  con- 
versation different  from  other  kinds  of  discourse,  introducing  the  important 
ideas  of  turn-taking,  grounding,  and  implicature.  Section  19.2  introduces 
the  speech  act  or  dialogue  act,  and  Section  19.3  gives  two  different  algo- 
rithms for  automatic  speech  act  interpretation.  Section  19.4  describes  how 
structure  and  coherence  in  dialogue  differ  from  the  discourse  structure  and 
coherence  we  saw  in  Chapter  18.  Finally,  Section  19.5  shows  how  each  of 
these  issues  must  be  addressed  in  choosing  an  architecture  for  a dialogue 
manager  as  paid  of  a conversational  agent. 


19.1  What  Makes  Dialogue  Different? 


Much  about  dialogue  is  similar  to  other  kinds  of  discourse  like  the  text  mono- 
logues of  Chapter  18.  Dialogues  exhibit  anaphora  and  discourse  structure 
and  coherence,  although  with  some  slight  changes  from  monologue.  For  ex- 
ample when  resolving  an  anaphor  in  dialogue  it’s  important  to  look  at  what 
the  other  speaker  said.  In  the  following  fragment  from  the  air  travel  conver- 
sation in  Figure  19.1  (to  be  discussed  below),  realizing  that  the  pronoun  they 
refers  to  non-stop  flights  in  C’s  utterance  requires  looking  at  A’s  previous 
utterance. 


A4:  Right.  There’s  three  non-stops  today. 
C5:  What  arc  they? 


Dialogue  does  differ  from  written  monologue  in  deeper  ways,  however. 
The  next  few  subsections  highlight  some  of  these  differences. 
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Turns  and  Utterances 

One  difference  between  monologue  and  dialogue  is  that  dialogue  is  char- 
acterized by  turn-taking.  Speaker  A says  something,  then  speaker  B,  then  turn-taking 
speaker  A,  and  so  on.  Figure  19.1  shows  a sample  dialogue  broken  up  into 
labeled  turns;  we’ve  chosen  this  human-human  dialogue  because  it  concerns 
travel  planning,  a domain  that  is  the  focus  of  much  recent  human-machine 
dialogue  research. 


Ci: 

...  I need  to  travel  in  May. 

Ai: 

And,  what  day  in  May  did  you  want  to  travel? 

C2: 

OK  uh  I need  to  be  there  for  a meeting  that’s  from  the  12th  to  the 
15  th. 

A2: 

And  you’re  flying  into  what  city? 

C3: 

Seattle. 

A3: 

And  what  time  would  you  like  to  leave  Pittsburgh? 

C4: 

Uh  hmm  I don’t  think  there’s  many  options  for  non-stop. 

A4: 

Right.  There’s  three  non-stops  today. 

C5: 

What  are  they? 

A5: 

The  first  one  departs  PGH  at  10:00am  arrives  Seattle  at  12:05  their 

time.  The  second  flight  departs  PGH  at  5:55pm,  arrives 

Seattle  at 

8pm.  And  the  last  flight  departs  PGH  at  8:15pm  arrives 
10:28pm. 

Seattle  at 

C6: 

OK  I'll  take  the  5ish  flight  on  the  night  before  on  the  1 1th. 

A6: 

On  the  11th?  OK.  Departing  at  5:55pm  arrives  Seattle  at  8pm,  US 
Air  flight  115. 

C 7: 

OK. 

Figure  19.1  A fragment  from  a telephone  conversation  between 

a speech 

recognition  researcher  client  (C)  and  a travel  agent  (A). 

How  do  speakers  know  when  is  the  proper  time  to  contribute  their  turn? 
Consider  the  timing  of  the  utterances  in  conversations  like  Figure  19.1.  First, 
notice  that  this  dialogue  has  no  noticeable  overlap.  That  is,  the  beginning  of 
each  speakers  turn  follows  the  end  of  the  previous  speaker’s  turn  (overlap 
would  have  been  indicated  by  surrounding  it  with  the  # symbol).  The  actual 
amount  of  overlapped  speech  in  American  English  conversation  seems  to  be 
quite  small;  Levinson  (1983)  suggests  the  amount  is  less  than  5%  in  gen- 
eral, and  probably  less  for  certain  kinds  of  dialogue  like  the  task-oriented 
dialogue  in  Figure  19.1.  If  speakers  aren’t  overlapping,  perhaps  they  are 
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waiting  a while  after  the  other  speaker?  This  is  also  very  rare.  The  amount 
of  time  between  turns  is  quite  small,  generally  less  than  a few  hundred  mil- 
liseconds.even  in  multi-party  discourse.  In  fact,  it  may  take  more  than  this 
few  hundred  milliseconds  for  the  next  speaker  to  plan  the  motor  routines  for 
producing  their  utterance,  which  means  that  speakers  begin  motor  planning 
for  their  next  utterance  before  the  previous  speaker  has  finished.  For  this  to 
be  possible,  natural  conversation  must  be  set  up  in  such  a way  that  (most 
of  the  time)  people  can  quickly  figure  out  who  should  talk  next,  and  ex- 
actly when  they  should  talk.  This  kind  of  turn-taking  behavior  is  generally 
studied  in  the  field  of  Conversation  Analysis  (CA).  In  a key  conversation- 
analytic  paper.  Sacks  et  al.  (1974)  argued  that  turn-taking  behavior,  at  least 
in  American  English,  is  governed  by  a set  of  turn-taking  rules.  These  rules 
apply  at  a transition-relevance  place,  or  TRP;  places  where  the  structure 
of  the  language  allows  speaker  shift  to  occur.  Flere  is  a simplified  version  of 
the  turn-taking  rules,  grouped  into  a single  three-part  rule;  see  Sacks  et  al. 
(1974)  for  the  complete  rules: 

(19. 1)  Turn-taking  Rule.  At  each  TRP  of  each  turn: 

a.  If  during  this  turn  the  current  speaker  has  selected  A as  the  next 
speaker  then  A must  speak  next. 

b.  If  the  current  speaker  does  not  select  the  next  speaker,  any  other 
speaker  may  take  the  next  turn. 

c.  If  no  one  else  takes  the  next  turn,  the  current  speaker  may  take 
the  next  turn. 

There  arc  a number  of  important  implications  of  rule  (19.1)  for  di- 
alogue modeling.  First,  subrule  (19.1a)  implies  that  there  arc  some  utter- 
ances by  which  the  speaker  specifically  selects  who  the  next  speaker  will 
be.  The  most  obvious  of  these  arc  questions,  in  which  the  speaker  selects 
another  speaker  to  answer  the  question.  Two-part  structures  like  QUESTION- 
ANSWER  arc  called  adjacency  pairs  (Schegloff,  1968);  other  adjacency 
pairs  include  GREETING  followed  by  GREETING,  COMPLIMENT  followed 
by  DOWNPLAYER,  REQUEST  followed  by  grant.  We  will  see  that  these 
pairs  and  the  dialogue  expectations  they  set  up  will  play  an  important  role  in 
dialogue  modeling. 

Subrule  (19.  la)  also  has  an  implication  for  the  interpretation  of  silence. 
While  silence  can  occur  after  any  turn,  silence  which  follows  the  first  paid  of 
an  adjacency  pair-part  is  significant  silence.  For  example  (Fevinson,  1983) 
notes  the  following  example  from  Atkinson  and  Drew  (1979);  pause  lengths 
arc  marked  in  parentheses  (in  seconds): 
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(19.2)  A: 

Is  there  something  bothering  you  or  not? 

(1-0) 

A: 

Yes  or  no? 

(1.5) 

A: 

Eh? 

B: 

No. 

Since  A has  just  asked  B a question,  the  silence  is  interpreted  as  a 
refusal  to  respond,  or  perhaps  a dispreferred  response  (a  response,  like  say- 
ing ‘no’  to  a request,  which  is  stigmatized).  By  contrast,  silence  in  other 
places,  for  example  a lapse  after  a speaker  finishes  a turn,  is  not  generally 
interpretable  in  this  way.  These  facts  are  relevant  for  user  interface  design 
in  spoken  dialogue  systems;  users  arc  distributed  by  the  pauses  in  dialogue 
systems  caused  by  slow  speech  recognizers  (Yankelovich  el  al.,  1995). 

Another  implication  of  (19.1)  is  that  transitions  between  speakers  don’t 
occur  just  anywhere;  the  transition-relevance  places  where  they  tend  to  oc- 
cur arc  generally  at  utterance  boundaries.  This  brings  us  to  the  next  differ- 
ence between  spoken  dialogue  and  textual  monologue  (of  course  dialogue 
can  be  written  and  monologue  spoken;  but  most  current  applications  of  di- 
alogue involve  speech):  the  spoken  utterance  versus  the  written  sentence. 
Recall  from  Chapter  9 that  utterances  differ  from  written  sentences  in  a num- 
ber of  ways.  They  tend  to  be  shorter,  arc  more  likely  to  be  single  clauses,  the 
subjects  arc  usually  pronouns  rather  than  full  lexical  noun  phrases,  and  they 
include  tilled  pauses,  repairs,  and  restarts. 

One  very  important  difference  not  discussed  in  Chapter  9 is  that  while 
written  sentences  and  paragraphs  arc  relatively  easy  to  automatically  seg- 
ment from  each  other,  utterances  and  turns  arc  quite  complex  to  segment. 
Utterance  boundary  detection  is  important  since  many  computational  dia- 
logue models  arc  based  on  extracting  an  utterance  as  a primitive  unit.  The 
segmentation  problem  is  difficult  because  a single  utterance  may  be  spread 
over  several  turns,  or  a single  turn  may  include  several  utterances.  For  ex- 
ample in  the  following  fragment  of  a dialogue  between  a travel  agent  and  a 
client,  the  agent’s  utterance  stretches  over  three  turns: 


(19.3) 


A:  Yeah  yeah  the  um  let  me  see  here  we’ve  got  you  on  American 
flight  nine  thirty  eight 
C:  Yep. 

A:  leaving  on  the  twentieth  of  June  out  of  Orange  County  John 
Wayne  Airport  at  seven  thirty  p.m. 

C:  Seven  thirty. 

A:  and  into  uh  San  Francisco  at  eight  fifty  seven. 
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By  contrast,  the  example  below  has  three  utterances  in  one  turn: 

(19.4)  A:  Three  two  three  and  seven  five  one.  OK  and  then  does  he 
know  there  is  a nonstop  that  goes  from  Dulles  to  San  Fran- 
cisco? Instead  of  connection  through  St.  Louis. 

Algorithms  for  utterance  segmentation  arc  based  on  many  boundary 
cues  such  as: 

cue  words  • cue  words:  Cue  (or  ‘clue’)  words  like  well,  and,  so,  etc.,  tend  to  occur 
at  the  beginnings  and  ends  of  utterances  (Reichman,  1985;  Hirschberg 
and  Litman,  1993). 

• /V-gram  word  sequences:  Specific  word  sequences  often  indicate  bound- 
aides.  /V-gram  grammars  can  be  trained  on  a training  set  labeled  with 
special  utterance-boundary  tags,  and  then  HMM  decoding  techniques 
can  be  used  to  find  the  most  likely  utterance  boundaries  in  a unlabeled 
test  set  (Mast  et  al,  1996;  Meteer  and  Iyer,  1996;  Stolcke  and  Shriberg, 
1996a). 

• prosody:  Prosodic  features  like  pitch,  accent,  phrase-final  lengthening 
and  pause  duration  play  a role  in  utterance/turn  segmentation,  as  dis- 
cussed in  Chapter  4,  although  the  relationship  between  utterances  and 
prosodic  units  like  the  intonation  unit  (Du  Bois  et  al.,  1983)  or  into- 

tional  national  phrase  (Beckman  and  Pierrehumbert,  1986))  is  complicated 

(Ladd,  1996;  Ford  and  Thompson,  1996;  Ford  et  al.,  1996,  inter  alia) 
indexFord,  C.. 

The  relationship  between  turns  and  utterances  seems  to  be  more  one- 
to-one  in  human-machine  dialogue  than  the  human-human  dialogues  dis- 
cussed above.  Probably  this  is  because  the  simplicity  of  current  systems 
causes  people  to  use  simpler  utterances  and  turns.  Thus  while  computational 
tasks  like  meeting  summarization  require  solving  quite  difficult  segmenta- 
tion problems,  segmentation  may  be  easier  for  conversational  agents. 
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Another  important  characteristic  of  dialogue  that  distinguishes  it  from  mono- 
logue is  that  it  is  a collective  act  performed  by  the  speaker  and  the  hearer. 
One  implication  of  this  collectiveness  is  that,  unlike  in  monologue,  the  speaker 
and  hearer  must  constantly  establish  common  ground  (Stalnaker,  1978),  the 
set  of  things  that  arc  mutually  believed  by  both  speakers.  The  need  to  achieve 
common  ground  means  that  the  hearer  must  ground  or  acknowledge  the 
speaker’s  utterances,  or  else  make  it  clear  that  the  there  was  a problem  in 
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reaching  common  ground.  For  example,  consider  the  role  of  the  word  mrn- 
hmm  in  the  following  fragment  of  a conversation  between  a travel  agent  and 
a client: 

A:  ...  returning  on  US  flight  one  one  one  eight. 

C:  Mmhmm 

The  word  mm-hmm  here  is  a continuer,  also  often  called  a backchan- 
nel  or  an  acknowledgement  token.  A continuer  is  a short  utterance  which 
acknowledges  the  previous  utterance  in  some  way,  often  cueing  the  other 
speaker  to  continue  talking  (Jefferson,  1984;  Schegloff,  1982;  Yngve,  1970). 
By  letting  the  speaker  know  that  the  utterance  has  ‘reached’  the  addressee, 
a continuer/backchannel  thus  helps  the  speaker  and  hearer  achieve  common 
ground.  Continuers  arc  just  one  of  the  ways  that  the  hearer  can  indicate 
that  she  believes  she  understands  what  the  speaker  meant.  Clark  and  Schae- 
fer (1989)  discuss  five  main  types  of  methods,  ordered  from  weakest  to 
strongest: 

1.  Continued  attention:  B shows  she  is  continuing  to  attend  and  there- 
fore remains  satisfied  with  As  presentation. 

2.  Relevant  next  contribution:  B starts  in  on  the  next  relevant  contribu- 
tion. 

3.  Acknowledgement:  B nods  or  says  a continuer  like  uh-huh , yeah , or 
the  like,  or  an  assessment  like  that’s  great. 

4.  Demonstration:  B demonstrates  all  or  part  of  what  she  has  under- 
stood A to  mean,  for  example  by  paraphrasing  or  reformulating  A’s 
utterance,  or  by  collaboratively  completing  As  utterance. 

5.  Display:  B displays  verbatim  all  or  paid  of  A’s  presentation. 

The  following  excerpt  from  our  sample  conversation  shows  a display 
of  understanding  by  A’s  repetition  of  on  the  11th: 

C(,:  OK  I'll  take  the  5ish  flight  on  the  night  before  on  the  1 1th. 

A6:  On  the  11th? 

Such  repeats  or  reformulations  arc  often  done  in  the  form  of  questions 
like  A6;  we  return  to  this  issue  on  page  735. 

Not  all  of  Clark  and  Shaefer’s  methods  arc  available  for  telephone- 
based  conversational  agents.  Without  eye-gaze  as  an  visual  indicator  of  at- 
tention, for  example,  continued  attention  isn’t  an  option.  In  fact  Stifelman 
et  al.  (1993)  and  (Yankelovich  el  al.,  1995)  point  out  that  users  of  speech- 
based  interfaces  arc  often  confused  when  the  system  doesn’t  give  them  an 
explicit  acknowledgement  signal  after  processing  the  user’s  utterances. 
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In  addition  to  these  acknowledgement  acts,  a hearer  can  indicate  that 
there  were  problems  in  understanding  the  previous  utterance,  for  example  by 
for  repair  issuing  a request  for  repair  like  the  following  Sw  itchboard  example: 

A:  Why  is  that? 

B:  Huh? 

A:  Why  is  that? 

Conversational  Implicature 

The  final  important  property  of  conversation  is  the  way  the  interpretation  of 
an  utterance  relies  on  more  than  just  the  literal  meaning  of  the  sentences. 
Consider  the  client’s  response  C2  from  the  sample  conversation  above,  re- 
peated here: 

A] : And,  what  day  in  May  did  you  want  to  travel? 

C2:  OK  uh  I need  to  be  there  for  a meeting  that's  from  the  12th  to  the  15th. 

Notice  that  the  client  does  not  in  fact  answer  the  question.  The  client 
merely  states  that  he  has  a meeting  at  a certain  time.  The  semantics  for  this 
sentence  produced  by  a semantic  interpreter  will  simply  mention  this  meet- 
ing. What  is  it  that  licenses  the  agent  to  infer  that  the  client  is  mentioning 
this  meeting  so  as  to  inform  the  agent  of  the  travel  dates? 

Now  consider  another  utterance  from  the  sample  conversation,  this  one 
by  the  agent: 

Ap  . . . There’s  three  non-stops  today. 

Now  this  statement  would  still  be  true  if  there  were  seven  non-stops 
today,  since  if  there  arc  seven  of  something,  there  arc  by  definition  also  three. 
But  what  the  agent  means  here  is  that  there  are  three  and  not  more  than 
three  non-stops  today.  How  is  the  client  to  infer  that  the  agent  means  only 
three  non-stops? 

These  two  cases  have  something  in  common;  in  both  cases  the  speaker 
seems  to  expect  the  hearer  to  draw  certain  inferences;  in  other  words,  the 
speaker  is  communicating  more  information  than  seems  to  be  present  in  the 
uttered  words.  These  kind  of  examples  were  pointed  out  by  Grice  (1975, 
implicature  1978)  as  part  of  his  theory  of  conversational  implicature.  Implicature 
means  a particular  class  of  licensed  inferences.  Grice  proposed  that  what  en- 
ables hearers  to  draw  these  inferences  is  that  conversation  is  guided  by  a set 
maxims  of  maxims,  general  heuristics  which  play  a guiding  role  in  the  interpretation 
of  conversational  utterances.  He  proposed  the  following  four  maxims: 
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• Maxim  of  Quantity:  Be  exactly  as  informative  as  is  required: 

1.  Make  your  contribution  as  informative  as  is  required  (for  the  cur- 
rent purposes  of  the  exchange). 

2.  Do  not  make  your  contribution  more  informative  than  is  required. 

• Maxim  of  Quality:  Try  to  make  your  contribution  one  that  is  true: 

1.  Do  not  say  what  you  believe  to  be  false. 

2.  Do  not  say  that  for  which  you  lack  adequate  evidence. 

• Maxim  of  Relevance:  Be  relevant. 

• Maxim  of  Manner:  Be  perspicuous: 

1.  Avoid  obscurity  of  expression. 

2.  Avoid  ambiguity. 

3.  Be  brief  (avoid  unnecessary  prolixity). 

4.  Be  orderly. 

It  is  the  Maxim  of  Quantity  (specifically  Quantity  1)  that  allows  the 
hearer  to  know  that  three  non-stops  didn’t  mean  seven  non-stops.  This  is 
because  the  hearer  assumes  the  speaker  is  following  the  maxims,  and  thus 
if  the  speaker  meant  seven  non-stops  she  would  have  said  seven  non-stops 
(‘as  informative  as  is  required’).  The  Maxim  of  Relevance  is  what  allows  the 
agent  to  know  that  the  client  wants  to  travel  by  the  12th.  The  agent  assumes 
the  client  is  following  the  maxims,  and  hence  would  only  have  mentioned 
the  meeting  if  it  was  relevant  at  this  point  in  the  dialogue.  The  most  natural 
inference  that  would  make  the  meeting  relevant  is  the  inference  that  the  client 
meant  the  agent  to  understand  that  his  departure  time  was  before  the  meeting 
time. 

These  three  properties  of  conversation  (turn-taking,  grounding,  and 
implicature)  will  play  an  important  role  in  the  discussion  of  dialogue  acts, 
dialogue  structure,  and  dialogue  managers  in  the  next  sections. 

19.2  Dialogue  Acts 

An  important  insight  about  conversation,  due  to  Austin  (1962),  is  that  an 
utterance  in  a dialogue  is  a kind  of  action  being  performed  by  the  speaker. 
This  is  particularly  clear  in  performative  sentences  like  the  following: 

(19.5)  I name  this  ship  the  Titanic. 

(19.6)  I second  that  motion. 
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(19.7)  I bet  you  five  dollars  it  will  snow  tomorrow. 

When  uttered  by  the  proper  authority,  for  example,  (19.5)  has  the  effect  of 
changing  the  state  of  the  world  (causing  the  ship  to  have  the  name  Titanic ) 
just  as  any  action  can  change  the  state  of  the  world.  Verbs  like  name  or 
second  which  perform  this  kind  of  action  arc  called  performative  verbs,  and 
speech  acts  Austin  called  these  kinds  of  actions  speech  acts.  What  makes  Austin’s  work 
so  far-reaching  is  that  speech  acts  arc  not  confined  to  this  small  class  of 
performative  verbs.  Austin’s  claim  is  that  the  utterance  of  any  sentence  in  a 
real  speech  situation  constitutes  three  kinds  of  acts: 

• locutionary  act:  the  utterance  of  a sentence  with  a particular  meaning 

• illocutionary  act:  the  act  of  asking,  answering,  promising,  etc.,  in 
uttering  a sentence. 

• perlocutionary  act:  the  (often  intentional)  production  of  certain  ef- 
fects upon  the  feelings,  thoughts,  or  actions  of  the  addressee  in  uttering 
a sentence. 

For  example,  Austin  explains  that  the  utterance  of  (19.8)  might  have  the 
I?-™  illocutionary  force  of  protesting  and  the  perlocutionary  effect  of  stopping 
the  addressee  from  doing  something,  or  annoying  the  addressee. 

(19.8)  You  can’t  do  that. 

The  term  speech  act  is  generally  used  to  describe  illocutionary  acts 
rather  than  either  of  the  other  two  levels.  Searle  (1975b),  in  modifying  a 
taxonomy  of  Austin’s,  suggests  that  all  speech  acts  can  be  classified  into  one 
of  5 major  classes: 

• Assertives:  committing  the  speaker  to  something’s  being  the  case  {sug- 
gesting, putting  forward,  swearing,  boasting,  concluding). 

• Directives:  attempts  by  the  speaker  to  get  the  addressee  to  do  some- 
thing {asking,  ordering,  requesting,  inviting,  advising,  begging). 

• Commissives:  committing  the  speaker  to  some  future  course  of  action 
( promising , planning,  vowing,  betting,  opposing). 

• Expressives:  expressing  the  psychological  state  of  the  speaker  about  a 
state  of  affairs  thanking,  apologizing,  welcoming,  deploring. 

• Declarations:  bringing  about  a different  state  of  the  world  via  the  ut- 
terance (including  many  of  the  performative  examples  above;  I resign, 
You’re  fired.) 
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While  speech  acts  provide  a useful  characterization  of  one  kind  of 
pragmatic  force,  more  recent  work,  especially  in  building  dialogue  systems, 
has  significantly  expanded  this  core  notion,  modeling  more  kinds  of  con- 
versational functions  that  an  utterance  can  play.  The  resulting  enriched  acts 
arc  called  dialogue  acts  (?)  or  conversational  moves  (Power,  1979;  Carletta 
et  al , 1997).  A recent  ongoing  effort  to  develop  dialogue  act  tagging  scheme 
is  the  DAMSL  (Dialogue  Act  Markup  in  Several  Layers)  architecture  (Allen 
and  Core,  1997;  Walker  et  al.,  1996;  Carletta  et  al,  1997;  Core  et  al,  1999), 
which  codes  various  levels  of  dialogue  information  about  utterances.  Two 
of  these  levels,  the  forward  looking  function  and  the  backward  looking 
function,  arc  extensions  of  speech  acts  which  draw  on  notions  of  dialogue 
structure  like  the  adjacency  pairs  mentioned  earlier  as  well  as  notions  of 
grounding  and  repair.  For  example,  the  forward  looking  function  of  an  utter- 
ance corresponds  to  something  like  the  Scarlc/A ustin  speech  act,  although 
the  DAMSL  tag  set  is  hierarchical,  and  is  focused  somewhat  on  the  kind  of 
dialogue  acts  that  tend  to  occur  in  task-oriented  dialogue: 


STATEMENT 

a claim  made  by  the  speaker 

INFO-REQUEST 

a question  by  the  speaker 

CHECK 

a question  for  confirming  information 
(see  below) 

INFLUENCE-ON-ADDRESSEE 

(=Searle’s  directives) 

OPEN-OPTION 

a weak  suggestion  or  listing  of  options 

ACTION-DIRECTIVE 

an  actual  command 

INFLUENCE-ON-SPEAKER 

(^Austin’s  commissives) 

OFFER 

speaker  offers  to  do  something, 
(subject  to  confirmation) 

COMMIT 

speaker  is  committed  to  doing  something 

CONVENTIONAL 

other 

OPENING 

greetings 

CLOSING 

farewells 

THANKING 

thanking  and  responding  to  thanks 

The  backward  looking  function  of  DAMSL  focuses  on  the  relationship 
of  an  utterance  to  previous  utterances  by  the  other  speaker.  These  include 
accepting  and  rejecting  proposals  (since  DAMSL  is  focused  on  task-oriented 
dialogue),  as  well  as  grounding  and  repair  acts  discussed  above. 


DIALOGUE 

ACTS 

MOVES 
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AGREEMENT 

ACCEPT 

ACCEPT-PART 

MAYBE 

REJECT-PART 

REJECT 

HOLD 

ANSWER 

UNDERSTANDING 

SIGNAL-NON-UNDER. 

SIGNAL-UNDER. 

ACK 

REPEAT-REPHRASE 

COMPLETION 


speaker’s  response  to  previous  proposal 
accepting  the  proposal 
accepting  some  paid  of  the  proposal 
neither  accepting  nor  rejecting  the  proposal 
rejecting  some  paid  of  the  proposal 
rejecting  the  proposal 

putting  off  response,  usually  via  subdialogue 
answering  a question 
whether  speaker  understood  previous 
speaker  didn't  understand  (usually  = NTRI) 
speaker  did  understand 
demonstrated  via  continuer  or  assessment 
demonstr  ated  via  repetition  or  reformulation 
demonstrated  via  collaborative  completion 


Figure  19.2  shows  a labeling  of  our  sample  conversation  using  versions 
of  the  DAMSL  Forward  and  Backward  tags. 


19.3  Automatic  Interpretation  of  Dialogue  Acts 


The  previous  section  introduced  dialogue  acts  and  other  activities  that  ut- 
terances can  perform.  This  section  turns  to  the  problem  of  identifying  or 
interpreting  these  acts.  That  is,  how  do  we  decide  whether  a given  input  is  a 
QUESTION,  a STATEMENT,  a SUGGEST  (directive),  or  an  ACKNOWL- 
EDGEMENT? 

At  first  glance,  this  problem  looks  simple.  We  saw  in  Chapter  9 that 
yes-no-questions  in  English  have  aux-inversion,  statements  have  declarative 
syntax  (no  aux-inversion),  and  commands  have  imperative  syntax  (sentences 
with  no  syntactic  subject),  as  in  (19.9): 

(19.9)  YES-NO-QUESTION  Will  breakfast  be  served  on  USAir  1557? 
STATEMENT  I don’t  care  about  lunch 

COMMAND  Show  me  flights  from  Milwaukee  to  Or- 

lando on  Thursday  night. 

It  seems  from  (19.9)  that  the  surface  syntax  of  the  input  ought  to  tell  us  what 
illocutionary  act  it  is.  Alas,  as  is  clear  from  Abbot  and  Costello’s  famous 
Who’s  on  First  routine  at  the  beginning  of  the  chapter,  things  arc  not  so  sim- 
ple. The  mapping  between  surface  form  and  illocutionary  act  is  not  obvious 
or  even  one-to-one. 
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[assert] 

Ci: 

...  I need  to  travel  in  May. 

[info- 

req,ack] 

Ai: 

And,  what  day  in  May  did  you  want  to  travel? 

[assert, 

C2: 

OK  uh  I need  to  be  there  for  a meeting  that’s  from  the 

answer] 

12th  to  the  15th. 

[info- 

req,ack] 

A2: 

And  you’re  flying  into  what  city? 

[assert,answer]  C3: 

Seattle. 

[info- 

req,ack] 

A3: 

And  what  time  would  you  like  to  leave  Pittsburgh? 

[check,hold] 

C4: 

Uh  hmm  I don’t  think  there’s  many  options  for  non- 
stop. 

[accept,  ack] 

A4: 

Right. 

[assert] 

There’s  three  non-stops  today. 

[info-req] 

C5: 

What  arc  they? 

[assert. 

A5: 

The  first  one  departs  PGF1  at  10:00am  arrives  Seattle 

open-option] 

at  12:05  their  time.  The  second  flight  departs  PGH 
at  5:55pm,  arrives  Seattle  at  8pm.  And  the  last  flight 
departs  PGH  at  8:15pm  arrives  Seattle  at  10:28pm. 

[accept,  ack] 

C6: 

OK  I'll  take  the  5ish  flight  on  the  night  before  on  the 
1 1th. 

[check,  ack] 

A6: 

On  the  1 1th? 

[assert,ack] 

OK.  Departing  at  5:55pm  arrives  Seattle  at  8pm,  US 
Air  flight  115. 

[ack] 

C7: 

OK. 

Figure  19.2 

Figure  19.1. 

A potential  DAMSL  labeling  of  the  conversation  fragment  in 

For  example,  the  following  utterance  spoken  to  an  ATIS  system  looks 
like  a YES-NO-QUESTION  meaning  something  like  Are  you  capable  of 
giving  me  a list  of. ..  ?: 

(19.10)  Can  you  give  me  a list  of  the  flights  from  Atlanta  to  Boston? 

In  fact,  however,  this  person  was  not  interested  in  whether  the  system 
was  capable  of  giving  a list;  this  utterance  was  actually  a polite  form  of  a 
DIRECTIVE  or  a REQUEST,  meaning  something  more  like  Please  give  me 
a list  of... . Thus  what  looks  on  the  surface  like  a QUESTION  can  really  be 
a REQUEST. 

Similarly,  what  looks  on  the  surface  like  a STATEMENT  can  really  be 
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INDIRECT 
SPEECH  ACTS 


INFERRED 


a QUESTION.  A very  common  kind  of  question,  called  a CHECK  question 
(Carletta  et  al.,  1997;  Labov  and  Fanshel,  1977),  is  used  to  ask 

the  other  participant  to  confirm  something  that  this  other  participant 
has  privileged  knowledge  about.  These  CHECKS  arc  questions,  but  they 
have  declarative  surface  form,  as  the  boldfaced  utterance  in  the  following 
snippet  from  another  travel  agent  conversation: 

A OPEN-OPTION  I was  wanting  to  make  some  arrangements  for 
a trip  that  I’m  going  to  be  taking  uh  to  LA  uh 
beginning  of  the  week  after  next. 

B HOLD  OK  uh  let  me  pull  up  your  profile  and  I'll  be 

right  with  you  here,  [pause] 

B CHECK  And  you  said  you  wanted  to  travel  next  week? 

A ACCEPT  Uh  yes. 


Utterances  which  use  a surface  statement  to  ask  a question,  or  a surface 
question  to  issue  a request,  arc  called  indirect  speech  acts.How  can  a surface 
yes-no-question  like  Can  you  give  me  a list  of  the  flights  from  Atlanta  to 
Boston?  be  mapped  into  the  correct  illocutionary  act  REQUEST.  Solutions 
to  this  problem  lie  along  a continuum  of  idiomaticity.  At  one  end  of  the 
continuum  is  the  idiom  approach,  which  assumes  that  a sentence  structure 
like  Can  you  give  me  a list?  or  Can  you  pass  the  salt?  is  ambiguous  between 
a literal  meaning  as  a YES-NO-QUESTION  and  an  idiomatic  meaning  as 
a request.  The  grammar  of  English  would  simply  list  REQUEST  as  one 
meaning  of  Can  you  X.  One  problem  with  this  approach  is  that  there  arc 
many  ways  to  make  an  indirect  request,  each  of  which  has  slightly  different 
surface  grammatical  structure  (see  below).  The  grammar  would  have  to  store 
the  REQUEST  meaning  in  many  different  places.  Furthermore,  the  idiom 
approach  doesn’t  make  use  of  the  fact  that  there  arc  semantic  generalizations 
about  what  makes  something  a legitimate  indirect  request. 

The  alternative  end  of  the  continuum  is  the  inferential  approach,  first 
proposed  by  Gordon  and  Lakoff  (1971)  and  taken  up  by  Searle  (1975a). 
Their  intuition  was  that  a sentence  like  Can  you  give  me  a list  of  flights  from 
Atlanta?  is  unambiguous,  meaning  only  Do  you  have  the  ability  to  give  me 
a list  of  flights  from  Atlanta?.  The  directive  speech  act  Please  give  me  a list 
of  flights  from  Atlanta  is  inferred  by  the  hearer. 

The  next  two  sections  will  introduce  two  models  of  dialogue  act  in- 
terpretation: an  inferential  model  called  the  plan  inference  model,  and  an 
idiom-based  model  called  the  cue  model. 
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Plan-Inferential  Interpretation  of  Dialogue  Acts 

The  plan-inference  approach  to  dialogue  act  interpretation  was  first  proposed 
by  Gordon  and  Lakoff  (1971)  and  Searle  (1975a)  when  they  noticed  that 
there  was  a structure  to  what  kind  of  things  a speaker  could  do  to  make  an 
indirect  request.  In  particular,  they  noticed  that  a speaker  could  mention  or 
question  various  quite  specific  properties  of  the  desired  activity  to  make  an 
indirect  request;  here  is  a partial  list  with  examples  from  the  ATIS  corpus: 

1.  The  speaker  can  question  the  hearer’s  ability  to  perform  the  activity 

• Can  you  give  me  a list  of  the  flights  from  Atlanta  to  Boston? 

• Could  you  tell  me  if  Delta  has  a hub  in  Boston? 

• Would  you  be  able  to,  uh,  put  me  on  a flight  with  Delta? 

2.  The  speaker  can  mention  speaker’s  wish  or  desire  about  the  activity 

• I want  to  fly  from  Boston  to  San  Francisco. 

• I would  like  to  stop  somewhere  else  in  between. 

• I’m  looking  for  one  way  flights  from  Tampa  to  Saint  Louis. 

• I need  that  for  Tuesday. 

• I wonder  if  there  are  any  flights  from  Boston  to  Dallas. 

3.  The  speaker  can  mention  the  hearer’s  doing  the  action 

• Would  you  please  repeat  that  information? 

• Will  you  tell  me  the  departure  time  and  arrival  time  on  this  Amer- 
ican flight? 

4.  The  speaker  can  question  the  speaker’s  having  permission  to  receive 
results  of  the  action 

• May  I get  a lunch  on  flight  U A two  one  instead  of  breakfast? 

• Could  I have  a listing  of  flights  leaving  Boston? 

Based  on  this  realization,  Searle  (1975a,  p.  73)  proposed  that  the  hearer’s 
chain  of  reasoning  upon  healing  Can  you  give  me  a list  of  the  flights  from 
Atlanta  to  Boston?  might  be  something  like  the  following  (modified  for  our 
ATIS  example): 

1.  X has  asked  me  a question  about  whether  I have  the  ability  to  give  a 
list  of  flights. 

2.  I assume  that  X is  being  cooperative  in  the  conversation  (in  the  Gricean 
sense)  and  that  his  utterance  therefore  has  some  aim. 

3.  X knows  I have  the  ability  to  give  such  a list,  and  there  is  no  alternative 
reason  why  X should  have  a purely  theoretical  interest  in  my  list-giving 
ability. 
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4.  Therefore  X’s  utterance  probably  has  some  ulterior  illocutionary  point. 
What  can  it  be? 

5.  A preparatory  condition  for  a directive  is  that  the  hearer  have  the  ability 
to  perform  the  directed  action. 

6.  Therefore  X has  asked  me  a question  about  my  preparedness  for  the 
action  of  giving  X a list  of  flights. 

7.  Furthermore,  X and  I arc  in  a conversational  situation  in  which  giving 
lists  of  flights  is  a common  and  expected  activity. 

8.  Therefore,  in  the  absence  of  any  other  plausible  illocutionary  act,  X is 
probably  requesting  me  to  give  him  a list  of  flights. 

The  inferential  approach  has  a number  of  advantages.  First,  it  explains 
why  Can  you  give  me  a list  of  flights  from  Boston?  is  a reasonable  way  of 
making  an  indirect  request  and  Boston  is  in  New  England  is  not:  the  former 
mentions  a precondition  for  the  desired  activity,  and  there  is  a reasonable  in- 
ferential chain  from  the  precondition  to  the  activity  itself.  The  inferential  ap- 
proach has  been  modeled  by  Allen,  Cohen,  and  Perrault  and  their  colleagues 
bdi  in  a number  of  influential  papers  on  what  have  been  called  BDI  (belief,  de- 

sire, and  intention)  models  (Allen,  1995).  The  earliest  papers,  such  as  Cohen 
and  Perrault  (1979),  offered  an  AI  planning  model  for  how  speech  acts  arc 
generated.  One  agent,  seeking  to  find  out  some  information,  could  use  stan- 
dard planning  techniques  to  come  up  with  the  plan  of  asking  the  hearer  to 
tell  the  speaker  the  information.  Perrault  and  Allen  (1980)  and  Allen  and 
Perrault  (1980)  also  applied  this  BDI  approach  to  comprehension , specifi- 
cally the  comprehension  of  indirect  speech  effects,  essentially  cashing  out 
Scarlc’s  (1975)  promissory  note  in  a computational  formalism. 

We’ll  begin  by  summarizing  Perrault  and  Allen’s  formal  definitions  of 
belief  and  desire  in  the  predicate  calculus.  We’ll  represent  “ S believes  the 
proposition  P”  as  the  two-place  predicate  B(S.  P).  Reasoning  about  belief  is 
done  with  a number  of  axiom  schemas  inspired  by  Hintikka  (1969b)  (such  as 
B(A,P ) A B(A.  Q ) =A  B(A.P  A £7);  see  Perrault  and  Allen  (1980)  for  details). 
Knowledge  is  defined  as  ‘true  belief’;  S knows  that  P will  be  represented  as 
KNOW(S.P),  defined  as  follows: 

KNOW(5,P)  =PAB{S,P) 

In  addition  to  knowing  that , we  need  to  define  knowing  whether.  S 
knows  whether  (KNOWIF)  a proposition  P is  true  if  S KNOWs  that  P or  S 
KNOWs  that  -P: 
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KNOWIF(S,P)  = KNOW (S,P)  V KNOW(S;  ->P) 

The  theory  of  desire  relies  on  the  predicate  WANT.  If  an  agent  S wants 
P to  be  true,  we  say  WANT  (S,P),  or  IT  (.S'.  P)  for  short.  P can  be  a state 
or  the  execution  of  some  action.  Thus  if  ACT  is  the  name  of  an  action, 
W(S,ACT(H))  means  that  S wants  H to  do  ACT.  The  logic  of  WANT  relies 
on  its  own  set  of  axiom  schemas  just  like  the  logic  of  belief. 

The  BDI  models  also  require  an  axiomatization  of  actions  and  plan- 
ning; the  simplest  of  these  is  based  on  a set  of  action  schemas  similar  to  the  schema 
AI  planning  model  STRIPS  (Fikes  and  Nilsson,  1971).  Each  action  schema 
has  a set  of  parameters  with  constraints  about  the  type  of  each  variable,  and 
three  parts: 

• Preconditions:  Conditions  that  must  already  be  true  in  order  to  suc- 
cessfully perform  the  action. 

• Effects:  Conditions  that  become  true  as  a result  of  successfully  per- 
forming the  action. 

• Body:  A set  of  partially  ordered  goal  states  that  must  be  achieved  in 
performing  the  action. 

In  the  travel  domain,  for  example,  the  action  of  agent  A booking  flight  F 1 
for  client  C might  have  the  following  simplified  definition: 

BOOK-FLIGHT(A,C,F): 

Constraints:  Agent(A)  A Flight(F)  A Client(C) 

Precondition:  Know(A,departure-date(F))  A KnowfA, departure - 

time(F))  A Know(A,origin-city(F))  A 

Know(A,destination-city(F))  A Know(A,flight-type(F))  A 
Has-Seats(F)  A W(C,(BOOK(A,C,F)))  A . . . 


Effect:  Flight-Booked(A,C,F) 

Body:  Make-Reservation(A,F,C) 

Cohen  and  Perrault  (1979)  and  Perrault  and  Allen  (1980)  use  this  kind 
of  action  specification  for  speech  acts.  For  example  here  is  Perrault  and 
Allen’s  definition  for  three  speech  acts  relevant  to  indirect  requests.  IN- 
FORM is  the  speech  act  of  informing  the  hearer  of  some  proposition  (Austin/Searle’s 
Assertive,  or  DAMSF’s  STATEMENT).  The  definition  of  INFORM  is  based 
on  Grice’s  (1957)  idea  that  a speaker  informs  the  hearer  of  something  merely 
by  causing  the  hearer  to  believe  that  the  speaker  wants  them  to  know  some- 
thing: 
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INFORM(S,H,P): 

Constraints:  Speaker(S)  A Hearer(H)  A Proposition(P) 

Precondition:  Know(S,P)  A W(S,  INFORM(S,  H.  P)) 

Effect:  Know(H,P) 

Body:  B(H,W(S,Know(H,P))) 

INFORMIF  is  the  act  used  to  inform  the  hearer  whether  a proposition 
is  true  or  not;  like  INFORM,  the  speaker  INFORMIFs  the  hearer  by  causing 
the  hearer  to  believe  the  speaker  wants  them  to  KNOWIF  something: 

INF  ORMIF  (S  ,H,P) : 

Constraints:  Speaker(S)  A Hearer(H)  A Proposition(P) 

Precondition:  KnowIf(S,  P)  A W(S,  INFORMIF(S,  H,  P)) 

Effect:  KnowIf(H,  P) 

Body:  B(H,  W(S,  KnowIf(H,  P))) 

REQUEST  is  the  directive  speech  act  for  requesting  the  hearer  to  per- 
form some  action: 

REQUESTS, H,  ACT): 

Constraints:  Speaker(S)  A Hearer(H)  A ACT(A)  A H is  agent  of  ACT 

Precondition:  W(S,ACT(H)) 

Effect:  W(H,ACT(H)) 

Body:  B(H,W(S,ACT(H))) 

Perrault  and  Allen’s  theory  also  requires  what  are  called  ‘surface-level 
acts’.  These  correspond  to  the  ‘literal  meanings’  of  the  imperative,  interroga- 
tive, and  declarative  structures.  For  example  the  ’surface-level’  act  S. REQUEST 
produces  imperative  utterances: 

S.REQUEST  (S,  H,  ACT): 

effect:  B(H,  W(S,ACT(H))) 

The  effects  of  S.REQUEST  match  the  body  of  a regular  REQUEST, 
since  this  is  the  default  or  standard  way  of  doing  a request  (but  not  the  only 
way).  This  ‘default’  or  ‘literal’  meaning  is  the  start  of  the  hearer’s  inference 
chain.  The  hearer  will  be  given  an  input  which  indicates  that  the  speaker  is 
requesting  the  hearer  to  inform  the  speaker  whether  the  hearer  is  capable  of 
giving  the  speaker  a list: 

S.REQUEST(S,H,InformIf(H,S,CanDo(H,Give(H,S,LIST)))) 

The  hearer  must  figure  out  that  the  speaker  is  actually  making  a re- 
quest: 


REQUEST(H,S,Give(H,S,LIST)) 
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The  inference  chain  from  the  request-to-inform-if-cando  to  the  request- 
to-give  is  based  on  a chain  of  plausible  inference,  based  on  heuristics  called 
plan  inference  (PI)  rules.  We  will  use  the  following  subset  of  the  rules  that  [nFaenREnce 
Perrault  and  Allen  (1980)  propose: 

• (PI.AE)  Action-Effect  Rule:  For  all  agents  S and  H,  if  Y is  an  effect 
of  action  X and  if  H believes  that  S wants  X to  be  done,  then  it  is 
plausible  that  H believes  that  S wants  Y to  obtain. 

• (PI.PA)  Precondition- Action  Rule:  For  all  agents  S and  H,  if  X is  a 
precondition  of  action  Y and  if  H believes  S wants  X to  obtain,  then  it 
is  plausible  that  H believes  that  S wants  Y to  be  done. 

• (PI.BA)  Body-Action  Rule:  For  all  agents  S and  H,  if  X is  paid  of  the 
body  of  Y and  if  H believes  that  S wants  X done,  then  it  is  plausible 
that  H believes  that  S wants  Y done. 

• (PI.KP)  Know-Desire  Rule:  For  all  agents  S and  H,  if  H believes  S 
wants  to  KNOWIF(P),  then  H believes  S wants  P to  be  true: 

B(H.  W (X KNOWIF(X P) ) ) p“lc  B{H,W(S.P)) 

• (EI.1)  Extended  Inference  Rule:  HB{H.  W(XX))  W{S,Y)) 

is  a PI  rule,  then 

B(H,W(S,B(H,(W(S,X)))))pl^eB(H,W(S,B(H.W(S,Y)))) 

is  a PI  rule.  (i.e.  you  can  prefix  B(H.  W (S) ) to  any  plan  inference  rule). 

Let’s  see  how  to  use  these  rules  to  interpret  the  indirect  speech  act  in 
Can  you  give  me  a list  of  flights  from  Atlanta?.  Step  (0)  in  the  table  below 
shows  the  speaker’s  initial  speech  act,  which  the  hearer  initially  interprets 
literally  as  a question.  Step  (1)  then  uses  Plan  Inference  rule  Action-Effect, 
which  suggests  that  if  the  speaker  asked  for  something  (in  this  case  infor- 
mation), they  probably  want  it.  Step  (2)  again  uses  the  Action-Effect  rule, 
here  suggesting  that  if  the  Speaker  want  an  INFORMIF,  and  KNOWIF  is  an 
effect  of  INFORMIF,  then  the  speaker  probably  also  wants  KNOWIF. 


Rule 


PI.AE 

PI.AE/EI 

PI.KP/EI 

PI.PA/EI 

PI.BA 


Step  Result 

(0)  S.REQUEST(S,H,InformIf(H,S,CanDo(H,Give(H,S,LIST)))) 

(1)  B(H,W(S,InformIf(H,S,CanDo(H,Give(H,S,LIST))))) 

(2)  B (H,W  (S  ,KnowIf(H,S  ,CanDo(H,Give(H,S  ,LIST))))) 

(3)  B(H,W(S,CanDo(H,Give(H,S,LIST)))) 

(4)  B (H,W  (S  ,Give(H,S  ,LIST))) 

(5)  REQUEST(H,S,Give(H,S,LIST)) 
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Step  (3)  adds  the  crucial  inference  that  people  don’t  usually  ask  about 
things  they  aren’t  interested  in;  thus  if  the  speaker  asks  whether  something  is 
true  (in  this  case  CanDo),  the  speaker  probably  wants  it  (CanDo)  to  be  true. 
Step  (4)  makes  use  of  the  fact  that  CanDo(ACT)  is  a precondition  for  (ACT), 
making  the  inference  that  if  the  speaker  wants  a precondition  (CanDo)  for 
an  action  (Give),  the  speaker  probably  also  wants  the  action  (Give).  Finally, 
step  (5)  relies  on  the  definition  of  REQUEST  to  suggest  that  if  the  speaker 
wants  someone  to  know  that  the  speaker  wants  them  to  do  something,  then 
the  speaker  is  probably  REQUESTing  them  to  do  it. 

In  giving  this  summary  of  the  plan-inference  approach  to  indirect  speech 
act  comprehension,  we  have  left  out  many  details,  including  many  necessary 
axioms,  as  well  as  mechanisms  for  deciding  which  inference  rule  to  apply. 
The  interested  reader  should  consult  Perrault  and  Allen  (1980)  and  the  other 
literature  suggested  at  the  end  of  the  chapter. 

Cue-based  interpretation  of  Dialogue  Acts 

The  plan-inference  approach  to  dialogue  act  comprehension  is  extremely 
powerful;  by  using  rich  knowledge  structures  and  powerful  planning  tech- 
niques the  algorithm  is  designed  to  address  even  subtle  indirect  uses  of  dia- 
logue acts.  The  disadvantage  of  the  plan-inference  approach  is  that  it  is  very 
time-consuming  both  in  terms  of  human  labor  in  development  of  the  plan- 
inference  heuristics,  and  in  terms  of  system  time  in  running  these  heuristics. 
In  fact,  by  allowing  all  possible  kinds  of  non-linguistic  reasoning  to  play  a 
part  in  discourse  processing,  a complete  application  of  this  approach  is  AI- 
ai-complete  complete.  An  Al-complete  problem  is  one  which  cannot  be  truly  solved 
without  solving  the  entire  problem  of  creating  a complete  artificial  intelli- 
gence. 

Thus  for  many  applications,  a less  sophisticated  but  more  efficient 
data-driven  method  may  suffice.  One  such  method  is  a valiant  of  the  id- 
iom method  discussed  above.  Recall  that  in  the  idiom  approach,  sentences 
like  Can  you  give  me  a list  of  flights  from  Atlanta?  have  two  literal  mean- 
ings; one  as  a question  and  one  as  a request.  This  can  be  implemented  in  the 
grammar  by  listing  sentence  structures  like  Can  you  X with  two  meanings. 
The  cue-based  approach  to  dialogue  act  comprehension  we  develop  in  this 
section  is  based  on  this  idiom  intuition. 

A number  of  researchers  have  used  what  might  be  called  a cue-based 
approach  to  dialogue  act  interpretation,  although  not  under  that  name.  What 
characterizes  a cue-based  model  is  the  use  of  different  sources  of  knowledge 
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(cues)  for  detecting  a dialogue  act,  such  as  lexical,  collocational,  syntac- 
tic, prosodic,  or  conversational-structure  cues.  The  models  we  will  describe 
use  (supervised)  machine-learning  algorithms,  trained  on  a corpus  of  dia- 
logues that  is  hand-labeled  with  dialogue  acts  for  each  utterance.  Which 
cues  arc  used  depends  on  the  individual  system.  Many  systems  rely  on  the 
fact  that  individual  dialogue  acts  often  have  what  Goodwin  (1996)  called  a 
microgrammar;  specific  lexical,  collocation,  and  prosodic  features  which 
arc  characteristic  of  them.  These  systems  also  rely  on  conversational  struc- 
ture. The  dialogue-act  interpretation  system  of  Jurafsky  el  al.  (1997),  for 
example,  relies  on  3 sources  of  information: 

1.  Words  and  Collocations:  Please  or  would  you  is  a good  cue  for  a 
REQUEST,  are  you  for  YES-NO-QUESTIONs. 

2.  Prosody:  Rising  pitch  is  a good  cue  for  a YES-NO-QUESTION.  Loud- 
ness or  stress  can  help  distinguish  the  yeah  that  is  an  AGREEMENT 
from  the  yeah  that  is  a BACKCHANNEL 

3.  Conversational  Structure:  A yeah  which  follows  a proposal  is  prob- 
ably an  AGREEMENT;  a.  yeah  which  follows  an  INFORM  is  probably 
a BACKCHANNEL. 

The  previous  section  focused  on  how  the  plan-based  approach  figured 
out  that  a surface  question  had  the  illocutionary  force  of  a REQUEST.  In  this 
section  we’ll  look  at  a different  kind  of  indirect  request;  the  CHECK,  exam- 
ining the  specific  cues  that  the  Jurafsky  el  al.  (1997)  system  uses  to  solve 
this  dialogue  act  identification  problem.  Recall  that  a CHECK  is  a subtype 
of  question  which  requests  the  interlocutor  to  confirm  some  information;  the 
information  may  have  been  mentioned  explicitly  in  the  preceding  dialogue 
(as  in  the  example  below),  or  it  may  have  been  inferred  from  what  the  inter- 
locutor said: 

A OPEN-OPTION  I was  wanting  to  make  some  arrangements  for 
a trip  that  I'm  going  to  be  taking  uh  to  LA  uh 
beginning  of  the  week  after  next. 

B HOLD  OK  uh  let  me  pull  up  your  profile  and  I’ll  be 

right  with  you  here,  [pause] 

B CHECK  And  you  said  you  wanted  to  travel  next  week? 

A ACCEPT  Uh  yes. 


Examples  of  possible  realizations  of  CHECKS  in  English  include: 
1.  As  tag  questions: 


MICROGRAM- 

MAR 
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(19.1 1)  From  the  Trains  corpus  (Allen  and  Core,  1997) 

U and  it’s  gonna  take  us  also  an  hour  to  load  boxcars  right? 

S right 

2.  As  declarative  questions,  usually  with  rising  intonation  (Quirk  et  al, 

1985b, p.  814) 

(19.12)  From  the  Switchboard  corpus  (Godfrey  et  al. , 1992) 

A and  we  have  a powerful  computer  down  at  work. 

B Oh  (laughter) 

B so,  you  don’t  need  a personal  one  (laughter)? 

A No 

3.  As  fragment  questions  (subsentential  units;  words,  noun-phrases,  clauses) 

(Weber,  1993) 

(19.13)  From  the  Map  Task  corpus  (Carletta  et  al,  1997) 

G Ehm,  curve  round  slightly  to  your  right. 

F To  my  right? 

G Yes. 

Studies  of  checks  have  shown  that,  like  the  examples  above,  they  are 
most  often  realized  with  declarative  structure  (i.e.  no  aux-inversion),  they  arc 
most  likely  to  have  rising  intonation  (Shriberg  et  al,  1998),  and  they  often 
have  a following  question  tag,  often  right,  (Quirk  et  al.,  1985b,  810-814),  as 
in  (19.11)  above.  They  also  arc  often  realized  as  ‘fragments’  (subsentential 
words  or  phrases)  with  rising  intonation  (Weber,  1993).  In  Sw  itchboard,  the 
REFORMULATION  subtype  of  CHECKS  have  a very  specific  microgram- 
mar, with  declarative  word  order,  often  you  as  subject  (31%  of  the  cases), 
often  beginning  with  so  (20%)  or  oh,  and  sometimes  ending  with  then.  Some 
examples: 

Oh  so  you  ’re  from  the  Midwest  too. 

So  you  can  steady  it. 

You  really  rough  it  then. 

Many  scholars,  beginning  with  Nagata  and  Morimoto  (1994),  realized 
that  much  of  the  structure  of  these  microgrammars  could  be  simply  captured 
by  training  a separate  word-A-gram  grammar  for  each  dialogue  act  (see  e.g. 
Suhm  and  Waibel,  1994;  Mast  et  al,  1996;  Jurafsky  et  al,  1997;  Warnke 
et  al,  1997;  Reithinger  and  Klesen,  1997;  Taylor  et  al,  1998).  These  sys- 
tems create  a separate  mini-corpus  from  all  the  utterances  which  realize  the 
same  dialogue  act,  and  then  train  a separate  word-A-gram  language  model 
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on  each  of  these  mini-corpora.  Given  an  input  utterance  u consisting  of  a 
sequence  of  words  W , they  then  choose  the  dialogue  act  d whose  /V-grarn 
grammar  assigns  the  highest  likelihood  to  W : 

d*  = argmax.P(<7|tV)  = argmaxP  (d)P(W\d)  (19.14) 

d d 

This  simple  /V-gram  approach  does  indeed  capture  much  of  the  micro- 
grammar; for  example  examination  of  the  high-frequency  bigram  pairs  in 
Switchboard  REFORMULATIONS  shows  that  the  most  common  bigrams  in- 
clude good  cues  for  REFORMULATIONS  like  so  you , sounds  like,  so  you’re, 
oh  so,  you  mean,  so  they,  and  so  it’s. 

Prosodic  models  of  dialogue  act  microgrammar  rely  on  phonological 
features  like  pitch  or  accent,  or  their  acoustic  correlates  like  FO,  duration,  and 
energy  discussed  in  Chapter  4 and  Chapter  7.  For  example  many  studies  have 
shown  that  capturing  the  rise  in  pitch  at  the  end  of  YES-NO-QUESTIONS 
can  be  a useful  cue  for  augmenting  lexical  cues  (Sag  and  Liberman,  1975; 
Pierrehumbert,  1980;  Waibel,  1988;  Daly  and  Zue,  1992;  Kompe  et  al, 
1993;  Taylor  et  ah,  1998).  Pierrehumbert  (1980)  also  showed  that  declar- 
ative utterances  (like  STATEMENTS)  have  final  lowering:  a drop  in  FO  at 
the  end  of  the  utterance.  One  system  which  relied  on  these  results,  Shriberg 
et  al.  (1998),  trained  CART-style  decision  trees  on  simple  acoustically-based 
prosodic  features  such  as  the  slope  of  FO  at  the  end  of  the  utterance,  the  av- 
erage energy  at  different  places  in  the  utterance,  and  various  duration  mea- 
sures. They  found  that  these  features  were  useful,  for  example,  in  distin- 
guishing the  four  dialogue  acts  STATEMENT  (S),  YES-NO  QUESTION  (QY), 
DECLARATIVE-QUESTIONS  like  CHECKS  (QD)  and  WH-QUESTIONS  (QW). 
Figure  19.3  shows  the  decision  tree  which  gives  the  posterior  probability 
P(d\f)  of  a dialogue  act  d type  given  sequence  of  acoustic  features  F.  Each 
node  in  the  tree  shows  four  probabilities,  one  for  each  of  the  four  dialogue 
acts  in  the  order  S,  QY,  QW,  QD;  the  most  likely  of  the  four  is  shown  as  the 
label  for  the  node.  Via  the  Bayes  rule,  this  probability  can  be  used  to  com- 
pute the  likelihood  of  the  acoustic  features  given  the  dialogue  act:  P(f\d). 

A final  important  cue  for  dialogue  act  interpretation  is  conversational 
structure.  One  simple  way  to  model  conversational  structure,  drawing  on 
the  idea  of  adjacency  pairs  (Schegloff,  1968;  Sacks  et  al.,  1974)  introduced 
above,  is  as  a probabilistic  sequence  of  dialogue  acts.  The  identity  of  the 
previous  dialogue  acts  can  then  be  used  to  help  predict  upcoming  dialogue 
acts.  Many  studies  have  modeled  dialogue  act  sequences  as  dialogue-act-V- 
grams  (Nagata  and  Morimoto,  1994;  Sulim  and  Waibel,  1994;  Warnke  et  al., 
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Figure  19.3  Decision  tree  for  the  classification  of  STATEMENT  (S),  YES-NO  QUES- 
TIONS (QY),  WH-QUESTIONS  (QW)  and  DECLARATIVE  QUESTIONS  (QD),  after 
Shriberg  et  al.  (1998).  Note  that  the  difference  between  S and  QY  toward  the  right  of  the  tree 
is  based  on  the  feature  norm.f  CLdif  f (normalized  difference  between  mean  FO  of  end  and 
penultimate  regions),  while  the  difference  between  WQ  and  QD  at  the  bottom  left  is  based 
on  utt_grad,  which  measures  FO  slope  across  the  whole  utterance. 


1997;  Chu-Carroll,  1998;  Stolcke  et  al.,  1998;  Taylor  et  al,  1998)j  often  as 
part  of  an  HMM  system  for  dialogue  acts  (Reithinger  et  al.,  1996;  Kita  et  al., 
1996;  Woszczyna  and  Waibel,  1994).  For  example  Woszczyna  and  Waibel 
(1994)  give  the  dialogue  HMM  shown  in  Figure  19.4  for  a Verbmobil-like 
appointment  scheduling  task. 

How  does  the  dialogue  act  interpreter  combine  these  different  cues  to 
find  the  most  likely  correct  sequence  of  correct  dialogue  acts  given  a con- 
versation? Stolcke  et  al.  (1998)  and  Taylor  et  al.  (1998)  apply  the  HMM 
intuition  of  Woszczyna  and  Waibel  (1994)  to  treat  the  dialogue  act  detection 
process  as  HMM-parsing.  Given  all  available  evidence  E about  a conversa- 
tion, the  goal  is  to  find  the  dialogue  act  sequence  D = {di,d2  ■ ■ . ,<7at}  that 
has  the  highest  posterior  probability  P(D\E)  given  that  evidence  (here  we 
are  using  capital  letters  to  mean  sequences  of  things).  Applying  Bayes’  Rule 
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we  get 

avgmaxP{D\E) 

D 

P(D)P(E\D) 

T m 

argmax  P(D)P(E\D)  (19.15) 

D 

Here  P(D)  represents  the  prior  probability  of  a sequence  of  dialogue  acts  D. 
This  probability  can  be  computed  by  the  dialogue  act  /V-grams  introduced 
by  Nagata  and  Morimoto  (1994).  The  likelihood  P(E  D)  can  be  computed 
from  the  other  two  sources  of  evidence:  the  microsyntax  models  (for  ex- 
ample the  different  word-A-gram  grammars  for  each  dialogue  act)  and  the 
microprosody  models  (for  example  the  decision  tree  for  the  prosodic  fea- 
tures of  each  dialogue  act).  The  word- /V-grams  models  for  each  dialogue  act 
can  be  used  to  estimate  P(W\D),  the  probability  of  the  sequence  of  words  W . 
The  microprosody  models  can  be  used  to  estimate  P(F\D),  the  probability 
of  the  sequence  of  prosodic  features  F. 

If  we  make  the  simplifying  (but  of  course  incorrect)  assumption  that 
the  prosody  and  the  words  arc  independent,  we  can  estimate  the  evidence 
likelihood  for  a sequence  of  dialogue  acts  D as  follows: 


P(E\D)  = P(F\D)P(W\D)  (19.16) 

We  can  compute  the  most  likely  sequence  of  dialogue  acts  D*  by  sub- 
stituting equation  (19.16)  into  equation  (19.15),  thus  choosing  the  dialogue 
act  sequence  which  maximizes  the  product  of  the  three  knowledge  sources 
(conversational  structure,  prosody,  and  lexical/syntactic  knowledge): 
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D*  = aigmax  P(D)P(F\D)P(W\D) 

D 

Standard  HMM-parsing  techniques  (like  Viterbi)  can  then  be  used  to 
search  for  this  most-probable  sequence  of  dialogue  acts  given  the  sequence 
of  input  utterances. 

The  HMM  method  is  only  one  way  of  solving  the  problem  of  data- 
driven  dialogue  act  identification.  The  link  with  HMM  tagging  suggests 
another  approach,  treating  dialogue  acts  as  tags,  and  applying  other  part- 
of-speech  tagging  methods.  Samuel  et  al.  (1998b),  for  example,  applied 
Transformation-Based  Learning  to  dialogue  act  tagging. 

Summary 

As  we  have  been  suggesting,  the  two  ways  of  doing  dialogue  act  interpre- 
tation (via  inference  and  via  cues)  each  have  advantages  and  disadvantages. 
The  cue -based  approach  may  be  more  appropriate  for  systems  which  require 
relatively  shallow  dialogue  structure  which  can  be  trained  on  large  corpora. 
If  a semantic  interpretation  is  required,  the  cue-based  approach  will  still  need 
to  be  augmented  with  a semantic  interpretation.  The  full  inferential  approach 
may  be  more  appropriate  when  more  complex  reasoning  is  required. 

19.4  Dialogue  Structure  and  Coherence 

Section  18.2  described  an  approach  to  determining  coherence  based  on  a set 
of  coherence  relations.  In  order  to  determine  that  a coherence  relation  holds, 
the  system  must  reason  about  the  constraints  that  the  relation  imposes  on 
the  information  in  the  utterances.  We  will  call  this  view  the  informational 
approach  to  coherence.  Historically,  the  informational  approach  has  been 
applied  predominantly  to  monologues. 

The  BDI  approach  to  utterance  interpretation  gives  rise  to  another  view 
of  coherence,  which  we  will  call  the  intentional  approach.  According  to 
this  approach,  utterances  arc  understood  as  actions,  requiring  that  the  hearer 
infer  the  plan-based  speaker  intentions  underlying  them  in  establishing  co- 
herence. In  contrast  to  the  informational  approach,  intentional  approach  has 
been  applied  predominantly  to  dialogue. 

The  intentional  approach  we  describe  here  is  due  to  Grosz  and  Sidner 
(1986),  who  argue  that  a discourse  can  be  represented  as  a composite  of  three 
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interacting  components:  a linguistic  structure,  an  intentional  structure, 
and  an  attentional  state.  The  linguistic  structure  contains  the  utterances  in 
the  discourse,  divided  into  a hierarchical  structure  of  discourse  segments. 
(Recall  the  description  of  discourse  segments  in  Chapter  18.)  The  atten- 
tional state  is  a dynamically-changing  model  of  the  objects,  properties,  and 
relations  that  arc  salient  at  each  point  in  the  discourse.  This  aligns  closely 
with  the  notion  of  a discourse  model  introduced  in  the  previous  chapter.  Cen- 
tering (see  Chapter  18)  is  considered  to  be  a theory  of  attentional  state  in  this 
approach. 

We  will  concentrate  here  on  the  third  component  of  the  approach,  the 
intentional  structure,  which  is  based  on  the  BDI  model  of  interpretation  de- 
scribed in  the  previous  section.  The  fundamental  idea  is  that  a discourse 
has  associated  with  it  an  underlying  purpose  that  is  held  by  the  person  who 
initiates  it,  called  the  discourse  purpose  (DP).  Likewise,  each  discourse  seg- 
ment within  the  discourse  has  a corresponding  purpose,  called  a discourse 
segment  purpose  (DSP).  Each  DSP  has  a role  in  achieving  the  DP  of  the  dis- 
course in  which  its  corresponding  discourse  segment  appeal's.  Listed  below 
are  some  possible  DPs/DSPs  that  Grosz  and  Sidner  give. 


LINGUISTIC 

STRUCTURE 

INTENTIONAL 

STRUCTURE 

ATTENTIONAL 

STATE 


DISCOURSE 

PURPOSE 


DISCOURSE 

SEGMENT 

PURPOSE 


1.  Intend  that  some  agent  intend  to  perform  some  physical  task. 

2.  Intend  that  some  agent  believe  some  fact. 

3.  Intend  that  some  agent  believe  that  one  fact  supports  another. 

4.  Intend  that  some  agent  intend  to  identify  an  object  (existing  physical 
object,  imaginary  object,  plan,  event,  event  sequence). 

5.  Intend  that  some  agent  know  some  property  of  an  object. 

As  opposed  to  the  larger  sets  of  coherence  relations  used  in  informa- 
tional accounts  of  coherence,  Grosz  and  Sidner  propose  only  two  such  re- 
lations: dominance  and  satisfaction-precedence.  DSPi  dominates  DSP2  if 
satisfying  DSP2  is  intended  to  provide  part  of  the  satisfaction  of  DSP  1.  DSPi 
satisfaction-precedes  DSP2  if  DSP]  must  be  satisfied  before  DSP2. 

As  an  example,  let’s  consider  the  dialogue  between  a client  (C)  and  a 
travel  agent  (A)  that  we  saw  earlier,  repeated  here  in  Figure  19.5. 

Collaboratively,  the  caller  and  agent  successfully  identify  a flight  that 
suits  the  caller’s  needs.  Achieving  this  joint  goal  required  that  a top-level 
discourse  intention  be  satisfied,  listed  as  II  below,  in  addition  to  several  in- 
termediate intentions  that  contributed  to  the  satisfaction  of  II,  listed  as  12-15. 


II:  (Intend  C (Intend  A (A  find  a flight  for  C))) 

12:  (Intend  A (Intend  C (Tell  C A departure  date))) 
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C] : I need  to  travel  in  May. 

Ai : And,  what  day  in  May  did  you  want  to  travel? 

C2:  OK  uh  I need  to  be  there  for  a meeting  that’s  from  the  12th  to  the 

15  th. 

A2:  And  you’re  flying  into  what  city? 

C3:  Seattle. 

A3 : And  what  time  would  you  like  to  leave  Pittsburgh? 

C4:  Uh  hmm  I don’t  think  there’s  many  options  for  non-stop. 

A4:  Right.  There’s  three  non-stops  today. 

C5:  What  arc  they? 

A5:  The  first  one  departs  PGH  at  10:00am  arrives  Seattle  at  12:05  their 

time.  The  second  flight  departs  PGH  at  5:55pm,  arrives  Seattle  at 
8pm.  And  the  last  flight  departs  PGH  at  8:15pm  arrives  Seattle  at 
10:28pm. 

C6:  OK  I'll  take  the  5ish  flight  on  the  night  before  on  the  1 1th. 

Af,:  On  the  11th?  OK.  Departing  at  5:55pm  arrives  Seattle  at  8pm,  US 

Air  flight  115. 

C7:  OK. 

Figure  19.5  A fragment  from  a telephone  conversation  between  a client  (C) 

and  a travel  agent  (A)  (repeated  from  Figure  19.1). 


13:  (Intend  A (Intend  C (Tell  C A destination  city))) 

14:  (Intend  A (Intend  C (Tell  C A departure  time))) 

15:  (Intend  C (Intend  A (A  find  a nonstop  flight  for  C))) 

Intentions  12-15  arc  all  subordinate  to  intention  11,  as  they  were  all  adopted 
to  meet  preconditions  for  achieving  intention  11.  This  is  reflected  in  the 
dominance  relationships  below. 

11  dominates  12 
II  dominates  13 
II  dominates  14 

11  dominates  15 

Furthermore,  intentions  12  and  13  needed  to  be  satisfied  before  intention  15, 
since  the  agent  needed  to  know  the  departure  date  and  destination  city  in 
order  to  staid  listing  nonstop  flights.  This  is  reflected  in  the  satisfaction- 
precedence  relationships  below. 

12  satisfaction-precedes  15 
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13  satisfaction-precedes  15 

The  dominance  relations  give  rise  to  the  discourse  structure  depicted 
in  Figure  19.6.  Each  discourse  segment  is  numbered  in  correspondence  with 
the  intention  number  that  serves  as  its  DP/DSP. 


DS1 

Ci  DS2  DS3  DS4  DS5 
A1-C2  A2-C3  A3  C4-C7 


Figure  19.6  Discourse  Structure  of  the  Flight  Reservation  Dialogue 


On  what  basis  does  this  set  of  intentions  and  relationships  between 
them  give  rise  to  a coherent  discourse?  It  is  their  role  in  the  overall  plan 
that  the  caller  is  inferred  to  have.  There  arc  a variety  of  ways  that  plans  can 
be  represented;  here  we  will  use  the  simple  STRIPS  model  described  in  the 
previous  section.  We  make  use  of  two  simple  action  schemas;  the  first  is  the 
one  for  booking  a flight,  repeated  from  page  731. 

BOOK-FLIGHT(A,C,F): 

Constraints:  Agent(A)  A Flight(F)  A Client(C) 

Precondition:  Know(A,departure-date(F))  A KnowfA. departure - 

time(F))  A Know(A,origin-city(F))  A 

Know(A,destination-city(F))  A KnowfA.flight-type(F))  A 
Has-Seats(F)  A W(C,(BOOK(A,C,F)))  A . . . 


Effect:  Flight-Booked(A,C,F) 

Body:  Make-Reservation(A,F,C) 


As  can  be  seen,  booking  a flight  requires  that  the  agent  know  a variety 
of  parameters  having  to  do  with  the  flight,  including  the  departure  date  and 
time,  origin  and  destination  cities,  and  so  forth.  The  utterance  with  which 
the  caller  initiates  the  example  dialogue  contains  the  origin  city  and  partial 
information  about  the  departure  date.  The  agent  has  to  request  the  rest;  the 
second  action  schema  we  use  represents  a simplified  view  of  this  action  (see 
Cohen  and  Perrault  (1979)  for  a more  in-depth  discussion  of  planning  wh- 
questions): 
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REQUEST-INFO(A,C,I): 

Constraints:  Agent(A)  A Client(C) 

Precondition:  Know(C,I) 

Effect:  Know(A.I) 

Body:  B(C,W(A,Know(A,I))) 


SUBDIA- 

LOGUES 


INFORMATION- 

SHARING 


SUBDIA- 


LOGUES 


CORRECTION 

SUBDIA- 

LOGUE 


Because  the  effects  of  REQUEST-INFO  match  each  precondition  of 
BOOK-FLIGHT,  the  former  can  be  used  to  serve  the  needs  of  the  latter.  Dis- 
course segments  DS2  and  DS3  are  cases  in  which  performing  REQUEST- 
INFO  succeeds  for  identifying  the  values  of  the  departure  date  and  desti- 
nation city  parameters  respectively.  Segment  DS4  is  also  a request  for  a 
parameter  value  (departure  time),  but  is  unsuccessful  in  that  the  caller  takes 
the  initiative  instead,  by  (implicitly)  asking  about  nonstop  flights.  Segment 
DS5  leads  to  the  satisfaction  of  the  top-level  DP  from  the  caller’s  selection 
of  a nonstop  flight  from  a short  list  that  the  agent  produced. 

Subsidiary  discourse  segments  like  DS2  and  DS3  arc  also  called  sub- 
dialogues. The  type  of  subdialogues  that  DS2  and  DS3  instantiate  arc  gener- 
ally called  knowledge  precondition  subdialogues  (Lochbaum  el  at,  1990; 
Lochbaum,  1998),  since  they  arc  initiated  by  the  agent  to  help  satisfy  pre- 
conditions of  a higher-level  goal  (in  this  case  addressing  the  client’s  request 
for  travel  in  May).  They  arc  also  called  information-sharing  subdialogues 
(Chu-CaiToll  and  Carberry,  1998). 

Later  on  in  a paid  of  the  conversation  not  given  in  Figure  19.5  is  another 
kind  of  subdialogue,  a correction  subdialogue  (Litman,  1985;  Litman  and 
Allen,  1987).  Utterances  C20  through  €23^  constitute  a correction  to  the 
previous  plan  of  returning  on  May  15: 


A17:  And  you  said  returning  on  May  15th? 

Cis:  Uh,  yeah,  at  the  end  of  the  day. 

A19:  OK.  There’s  #two  non-stops  . . .# 

C 20 : #Act. . . actually#,  what  day  of  the  week  is  the  15th? 

A21:  It's  a Friday. 

C22:  Uh  hmm.  I would  consider  staying  there  an  extra  day  til  Sunday. 
A23a:  OK...  OK. 

A23fo:  On  Sunday  I have  . . . 


Other  kinds  of  subdialogues  that  have  been  addressed  in  the  literature 
subtask  include  subtask  subdialogues  (Grosz,  1974),  which  arc  used  to  deal  with 
subtasks  of  the  overall  task  in  a task-oriented  dialogue,  and  correction  sub- 
subdiaction  dialogues  (or  negotiation  subdialogues)  which  arc  used  to  deal  with  con- 

1 /■vu  icc  rr  \ o o / 

SUBDIA- 

LOGUES 
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diets  or  collaborative  negotiation  between  the  participants  (Chu-Carroll  and 
Carberry,  1998). 

Determining  Intentional  Structure  Algorithms  for  inferring  intentional 
structure  in  dialogue  (and  spoken  monologue)  work  similarly  to  algorithms 
for  inferring  dialogue  acts.  Many  algorithms  apply  valiants  of  the  BDI 
model  (e.g.  Litman,  1985;  Grosz  and  Sidner,  1986;  Litman  and  Allen,  1987; 

Carberry,  1990;  Passonneau  and  Litman,  1993;  Chu-Carroll  and  Carberry, 

1998).  Others  rely  on  similar  cues  to  those  described  for  utterance-  and 
turn-segmentation  on  page  720,  including  cue  words  and  phrases  (Reich- 
man,  1985;  Grosz  and  Sidner,  1986;  Hirschberg  and  Litman,  1993),  prosody 
(Grosz  and  Hirschberg,  1992;  Hirschberg  and  Pierrehumbert,  1986;  Hirschberg 
and  Nakatani,  1996),  and  other  cues.  For  example  Pierrehumbert  and  Hirschberg 
(1990)  argue  that  certain  boundary  tones  might  be  used  to  suggest  a domi-  tonesdary 
nance  relation  between  two  intonational  phrases. 

Informational  versus  Intentional  Coherence  As  we  just  saw,  the  key  to 
intentional  coherence  lies  in  the  ability  of  the  dialogue  participants  to  rec- 
ognize each  other’s  intentions  and  how  they  fit  into  the  plans  they  have.  On 
the  other  hand,  as  we  saw  in  the  previous  chapter,  informational  coherence 
lies  in  the  ability  to  establish  certain  kinds  of  content-hearing  relationships 
between  utterances.  So  one  might  ask  what  the  relationship  between  these 
arc:  does  one  obviate  the  need  for  the  other,  or  do  we  need  both? 

Moore  and  Pollack  (1992),  among  others,  have  argued  that  in  fact  both 
levels  of  analysis  must  co-exist.  Let  us  assume  that  after  our  agent  and  caller 
have  identified  a flight,  the  agent  makes  the  statement  in  passage  (19.17). 

(19.17)  You'll  want  to  book  your  reservations  before  the  end  of  the  day. 

Proposition  143  goes  into  effect  tomorrow. 

This  passage  can  be  analyzed  either  from  the  intentional  or  informational 
perspective.  Intentionally,  the  agent  intends  to  convince  the  caller  to  book 
her  reservation  before  the  end  of  the  day.  One  way  to  accomplish  this  is  to 
provide  motivation  for  this  action,  which  is  the  role  served  by  uttering  the 
second  sentence.  Informationally,  the  two  sentences  satisfy  the  Explanation 
relation  described  in  the  last  chapter,  since  the  second  sentence  provides  a 
cause  for  the  effect  of  wanting  to  book  the  reservations  before  the  end  of  the 
day. 

Depending  on  the  knowledge  of  the  caller,  recognition  at  the  informa- 
tional level  might  lead  to  recognition  of  the  speaker's  plan,  or  vice  versa. 

Say,  for  instance,  that  the  caller  knows  that  Proposition  143  imposes  a new 
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tax  on  airline  tickets,  but  did  not  know  the  intentions  of  the  agent  in  uttering 
the  second  sentence.  From  the  knowledge  that  a way  to  motivate  an  action  is 
to  provide  a cause  that  has  that  action  as  an  effect,  the  caller  can  surmise  that 
the  agent  is  trying  to  motivate  the  action  described  in  the  first  sentence.  Al- 
ternatively, the  caller  might  have  surmised  this  intention  from  the  discourse 
scenario,  but  have  no  idea  what  Proposition  143  is  about.  Again,  knowing  the 
relationship  between  establishing  a cause-effect  relationship  and  motivating 
something,  the  caller  might  be  led  to  assume  an  Explanation  relationship, 
which  would  require  that  she  infers  that  the  proposition  is  somehow  bad  for 
airline  ticket  buyers  (e.g.,  a tax).  Thus,  at  least  in  some  cases,  both  levels  of 
analysis  appeal-  to  be  required. 


19.5  Dialogue  Managers  in  Conversational  Agents 


SINGLE 

INITIATIVE 

SYSTEM 

INITIATIVE 


The  idea  of  a conversational  agent  is  a captivating  one,  and  conversational 
agents  like  ELIZA,  PARRY,  or  SHRDLU  have  become  some  of  the  best- 
known  examples  of  natural  language  technology.  Modern  examples  of  con- 
versational agents  include  airline  travel  information  systems,  speech-based 
restaurant  guides,  and  telephone  interfaces  to  email  or  calendars.  The  dia- 
logue manager  is  the  component  of  such  conversational  agents  that  controls 
the  flow  of  the  dialogue,  deciding  at  a high  level  how  the  agents  side  of  the 
conversation  should  proceed,  what  questions  to  ask  or  statements  to  make, 
and  when  to  ask  or  make  them. 

This  section  briefly  summarizes  some  issues  in  dialogue  manager  de- 
sign, discussing  some  simple  systems  based  on  finite-state  automata  and  pro- 
duction rules,  and  some  more  complex  ones  based  on  more  sophisticated 
BDI-style  reasoning  and  planning  techniques. 

The  simplest  dialogue  managers  are  based  on  finite-state  automata.  For 
example,  imagine  a trivial  airline  travel  system  whose  job  was  to  ask  the  user 
for  a departure  city,  a destination  city,  a time,  and  any  airline  preference.  Fig- 
ure 19.7  shows  a sample  dialogue  manager  for  such  a system.  The  states  of 
the  FSA  correspond  to  questions  that  the  dialogue  manager  asks  the  user,  and 
the  arcs  correspond  to  actions  to  take  depending  on  what  the  user  responds. 

Systems  which  completely  control  the  conversation  in  this  way  are 
called  single  initiative  or  system  initiative  systems.  While  this  simple  di- 
alogue manager  architecture  is  sufficient  for  some  tasks  (for  example  for 
implementing  a speech  interface  to  an  automatic  teller  machine  or  a simple 
geography  quiz),  it  is  probably  too  restricted  for  a speech  based  travel  agent 
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not-city(answer) 


not-city(answer) 


([^What  city  are  you  leaving  from?^^  -»^"p[ease  say  the  name 


is-city(answer) 


is-city(answer) 
not-city(answer) 


of  a city''^p  J 


not-city(answer) 


here  are  you  goingT^^ — »^"piease  say  the  name  of  a city" 


is-city(answer) 


^^When  would  you  like  to  leave?" 


is-city(answer) 
not-time(answer) 


not-time(answer) 


say  morning  or  evening 


►^Please 


is-time(answer) 


Ts^time(answer) 
not-yes-or-no(answer) 


£0 


not-yes-or-no(answer) 


j j£_ not-yes-or-no(answer)  v 

^^Do~you  want  to  specify  a carrier?^^ saY  ‘yes’or  ‘no^^^^  J 


is-yes(answer) 


is-no(answer) 


^^Which  carrier  do  you  prefer?"^) 


Figure  19.7  A simple  finite-state  automaton  architecture  for  a dialogue 
manager. 


system  (see  the  discussion  in  McTear  (1998)).  One  reason  is  that  it  is  con- 
venient for  users  to  use  more  complex  sentences  that  may  answer  more  than 
one  question  at  a time,  as  in  the  following  ATIS  example: 

I want  a flight  from  Milwaukee  to  Orlando  one  way  leaving  after 
five  pm  on  Wednesday. 

Many  speech-based  question  answering  systems,  beginning  with  the 
influential  GUS  system  for  airline  travel  planning  (Bobrow  et  ah,  1977),  and 
including  more  recent  ATIS  systems  and  other  travel  and  restaurant  guides, 
are  frame-  or  template-based.  For  example,  a simple  airline  system  might 
have  the  goal  of  helping  a user  find  an  appropriate  flight.  It  might  have  a 
frame  or  template  with  slots  for  various  kinds  of  information  the  user  might 
need  to  specify.  Some  of  the  slots  come  with  prespecified  questions  to  ask 
the  user: 

Slot  Optional  Question 

From_Airport  “From  what  city  arc  you  leaving?” 

To_Airport  “Where  arc  you  going?” 

Dep_time  “When  would  you  like  to  leave?” 

ArrJime  “When  do  you  want  to  arrive?” 

Fare  .class 

Airline 

Oneway 


FRAME 

TEMPLATE 
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Such  a simple  dialogue  manager  may  just  ask  questions  of  the  user, 
tilling  out  the  template  with  the  answers,  until  it  has  enough  information  to 
perform  a data  base  query,  and  then  return  the  result  to  the  user.  Not  every 
slot  may  have  a associated  question,  since  the  dialogue  designer  may  not 
want  the  user  deluged  with  questions.  Nonetheless,  the  system  must  be  able 
to  till  these  slots  if  the  user  happens  to  specify  them. 

Even  such  simple  domains  require  more  than  this  single-template  ar- 
chitecture. For  example,  there  is  likely  to  be  more  than  one  flight  which 
meet  the  user’s  constraints.  This  means  that  the  user  will  be  given  a list  of 
choices,  either  on  a screen  or,  for  a purely  telephone  interface,  by  listing 
them  verbally.  A template -based  system  can  then  have  another  kind  of  tem- 
plate which  has  slots  for  identifying  elements  of  lists  of  flights  {How  much 
is  the  first  one?  or  Is  the  second  one  non-stop?).  Other  templates  might  have 
general  route  information  (for  questions  like  Which  airlines  fly  from  Boston 
to  San  Francisco?),  information  about  airfare  practices  (for  questions  like 
Do  I have  to  stay  a specific  number  of  days  to  get  a decent  airfare?)  or  about 
cai-  or  hotel  reservations.  Since  users  may  switch  from  template  to  template, 
and  since  they  may  answer  a future  question  instead  of  the  one  the  system 
asked,  the  system  must  be  able  to  disambiguate  which  slot  of  which  tem- 
plate a given  input  is  supposed  to  fill,  and  then  switch  dialogue  control  to 
that  template.  A template -based  system  is  thus  essentially  a production  rule 
system.  Different  types  of  inputs  cause  different  productions  to  fire,  each  of 
which  can  flexibly  fill  in  different  templates.  The  production  rules  can  then 
switch  control  based  on  factors  such  as  the  the  user’s  input  and  some  simple 
dialogue  history  like  the  last  question  that  the  system  asked. 

The  template  or  production-rule  dialogue  manager  architecture  is  often 
used  when  the  set  of  possible  actions  the  user  could  want  to  take  is  relatively 
limited,  but  where  the  user  might  want  to  switch  around  a bit  among  these 
things. 

The  limitations  of  both  the  template -based  and  FSA-based  dialogue 
managers  are  obvious.  Consider  the  client’s  utterance  C4  in  the  fragment  of 
sample  dialogue  of  Figure  19.5  on  page  742,  repeated  here: 

A3:  And  what  time  would  you  like  to  leave  Pittsburgh? 

C4:  Uh  hmm  I don’t  think  there’s  many  options  for  non-stop. 

A4:  Right.  There’s  three  non-stops  today. 

C5 : What  are  they? 

A5:  The  first  one  departs  PGH  at  10:00am  . . . 

What  the  client  is  doing  in  C4  is  taking  control  or  initiative  of  the 
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dialogue.  C4  is  an  indirect  request,  asking  the  agent  to  check  on  non-stop 
flights.  It  would  not  be  appropriate  for  the  system  to  just  set  the  WANTS 
NON-STOP  field  in  a template  and  ask  the  user  again  for  the  departure  time. 
The  system  needs  to  realize  that  the  user  has  indicated  that  a non-stop  flight 
is  a priority  and  that  the  system  should  focus  on  that  next. 

Conversational  agents  also  need  to  use  the  grounding  acts  described 
on  page  721.  For  example,  when  the  user  makes  a choice  of  flights,  it’s 
important  for  the  agent  to  indicate  to  the  client  that  it  has  understood  this 
choice.  Repeated  below  is  an  example  of  such  grounding  excerpted  from 
our  sample  conversation: 

C(, : OK  I'll  take  the  5ish  flight  on  the  night  before  on  the  1 1th. 

A6:  On  the  11th?  OK. 

It  is  also  important  for  a computational  conversational  agent  to  use 
requests  for  repairs,  since  given  the  potential  for  errors  in  the  speech  recog- 
nition or  the  understanding,  there  will  often  be  times  when  the  agent  is  con- 
fused or  does  not  understand  the  user’s  request. 

In  order  to  address  these  and  other  problems,  more  sophisticated  dia- 
logue managers  can  be  built  on  the  BDI  (belief,  desire,  intention)  architec- 
ture described  on  page  730.  Such  systems  arc  often  integrated  with  logic- 
based  planning  models,  and  treat  a conversation  as  a sequence  of  actions  to 
planned. 

Let’s  consider  the  dialogue  manager  of  the  TRAINS-93  system;  the 
system  is  described  in  Allen  et  al.  (1995),  the  dialogue  manager  in  Traum 
and  Allen  (1994).  The  TRAINS  system  is  a spoken-language  conversational 
planning  agent  whose  task  is  to  assist  the  user  in  managing  a railway  trans- 
portation system  in  a microworld.  For  example,  the  user  and  the  system 
might  collaborate  in  planning  to  move  a boxcar  of  oranges  from  one  city  to 
another.  The  TRAINS  dialogue  manager  maintains  the  flow  of  conversation 
and  addresses  the  conversational  goals  (such  as  coming  up  with  a operational 
plan  for  achieving  the  domain  goal  of  successfully  moving  oranges).  To  do 
this,  the  manager  must  model  the  state  of  the  dialogue,  its  own  intentions, 
and  the  user’s  requests,  goals,  and  beliefs.  The  manager  uses  a conversation 
act  interpreter  to  semantically  analyze  the  user’s  utterances,  a domain  plan- 
ner and  executer  to  solve  the  actual  transportation  domain  problems,  and  a 
generator  to  generate  sentences  to  the  user.  Figure  19.8  shows  an  outline  of 
the  TRAINS-93  dialogue  manager  algorithm. 

The  algorithm  keeps  a queue  of  conversation  acts  it  needs  to  generate. 
Acts  are  added  to  the  queue  based  on  grounding,  dialogue  obligations,  or 
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Dialogue_Manager 

while  conversation  is  not  finished 
if  user  has  completed  a turn 
then  interpret  user’s  utterance 
if  system  has  obligations 
then  address  obligations 
else  if  system  has  turn 

then  if  system  has  intended  conversation  acts 
then  call  generator  to  produce  NL  utterances 
else  if  some  material  is  ungrounded 
then  address  grounding  situation 
else  if  high-level  goals  are  unsatisfied 
then  address  goals 

else  release  turn  or  attempt  to  end  conversation 
else  if  no  one  has  turn 
then  take  turn 
else  if  long  pause 
then  take  turn 


Figure  19.8  A dialogue  manager  algorithm,  slightly  modified  from  Traum 
and  Allen  (1994). 


the  agent’s  goals.  Let’s  examine  each  of  these  sources.  Grounding  acts  were 
discussed  on  page  720;  recall  that  a previous  utterance  can  be  grounded  by  an 
explicit  backchannel  (e.g.  uh-huh,  yeah,  or  under  certain  circumstances  ok), 
or  by  repeating  back  paid  of  the  utterance.  Utterances  can  also  be  grounded 
implicitly  by  ‘taking  up'  the  utterance,  i.e.  continuing  in  a way  which  makes 
it  clear  that  the  utterance  was  understood,  such  as  by  answering  a question. 

Obligations  arc  used  in  the  TRAINS  system  to  enable  the  system  to 
correctly  produce  the  second-pair  paid  of  an  adjacency  pair.  That  is,  when  a 
user  REQUESTS  something  of  the  system  (e.g.  REQUEST(Give(List)),  or 
REQUEST(InformIf(NonStop(FLIGHT-201)))),  the  REQUEST  sets  up  an 
obligation  for  the  system  to  address  the  REQUEST  either  by  accepting  it, 
and  then  performing  it  (giving  the  list  or  informing  whether  flight  201  is 
non-stop),  or  by  rejecting  it. 

Finally,  the  TRAINS  dialogue  manager  must  reason  about  its  own 
goals.  For  the  travel  agent  domain,  the  dialogue  manager’s  goal  might  be 
to  find  out  the  client’s  travel  goal  and  then  create  an  appropriate  plan.  Let’s 
pretend  that  the  human  travel  agent  for  the  conversation  in  Figure  19.5  was 


Methodology  Box:  Designing  Dialogue  Systems 

How  does  a dialogue  system  developer  choose  dialogue  strategies, 
architectures,  prompts,  error  messages,  and  so  on?  The  three  design 
principles  of  Gould  and  Lewis  (1985)  can  be  summarized  as 

Key  Concept  #8.  User-Centered  Design:  Study  the  user 

and  task,  build  simulations  and  prototypes,  and  iteratively  test 
them  on  the  user  and  fix  the  problems. 

1.  Early  Focus  on  Users  and  Task:  Understand  the  potential 
users  and  the  nature  of  the  task,  via  interviews  with  users  and  in- 
vestigation of  similar  systems.  Study  of  related  human-human  dia- 
logues can  also  be  useful,  although  the  language  in  human-machine 
dialogues  is  usually  simpler  than  in  human-human  dialogues  (for 
example  pronouns  arc  rare  in  human-machine  dialogue  and  arc  very 
locally  bound  when  they  do  occur  - Guindon,  1988). 

2.  Build  Prototypes:  In  the  children’s  book  The  Wizard  of 
Oz  (Baum  1900),  the  Wizard  turned  out  to  be  just  a simulation 
controlled  by  a man  behind  a curtain.  In  Wizard-of-Oz  (WOZ)  or 
PNAMBIC  (Pay  No  Attention  to  the  Man  Behind  the  Curtain)  sys- 
tems, the  users  interact  with  what  they  think  is  a software  system, 
but  is  in  fact  a human  operator  (‘wizard’)  behind  some  disguising 
interface  software  (e.g.  Gould  et  al,  1983;  Good  el  al.,  1984;  Fraser 
and  Gilbert,  1991)  indexGood,  M.  D..  A WOZ  system  can  be  used 
to  test  out  an  architecture  without  implementing  the  complete  sys- 
tem; only  the  interface  software  and  databases  need  to  be  in  place.  It 
is  difficult  for  the  wizard  to  exactly  simulate  the  errors,  limitations, 
or  time  constraints  of  a real  system;  results  of  WOZ  studies  arc  thus 
somewhat  idealized. 

3.  Iterative  Design:  An  iterative  design  cycle  with  embedded 
user  testing  is  essential  in  system  design  (Nielsen,  1992;  Cole  et  al, 
1994,  1997;  Yankelovich  etal,  1995;  Landauer,  1995).  For  example 
Stifelman  et  al.  (1993)  and  Yankelovich  et  al.  (1995)  found  that  users 
of  speech  systems  consistently  tried  to  interrupt  the  system  (barge 
in),  suggesting  a redesign  of  the  system  to  recognized  overlapped 
speech.  Kamm  (1994)  and  Cole  et  al.  (1993)  found  that  directive 
prompts  (‘Say  yes  if  you  accept  the  call,  otherwise,  say  no')  or  the 
use  of  constrained  forms  (Oviatt  et  al,  1993)  produced  better  results 
than  open-ended  prompts  like  ‘Will  you  accept  the  call?’. 
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a system  and  explore  what  the  state  of  a TRAINS-style  dialogue  manager 
would  have  to  be  to  act  appropriately.  Let’s  start  with  the  state  of  the  dia- 
logue manager  (formatted  following  Traum  and  Allen  (1994))  after  the  first 
utterances  in  our  sample  conversation  (repeated  here): 

C i : I want  to  go  to  Pittsburgh  in  May. 

The  client/user  has  just  finished  a turn  with  an  INFORM  speech  act. 
The  system  has  the  discourse  goal  of  finding  out  the  user’s  travel  goal  (e.g. 
‘Wanting  to  go  to  Pittsburgh  on  may  15  and  returning. . . ’),  and  creating 
a travel  plan  to  accomplish  that  goal.  The  following  table  shows  the  five 
parameters  of  the  system  state:  the  list  of  obligations,  the  list  of  intended 
speech  acts  to  be  passed  to  the  generator,  the  list  of  the  user’s  speech  acts 
that  still  need  to  be  acknowledged,  the  list  of  discourse  goals,  and  whether 
the  system  or  the  user  holds  the  turn: 

Discourse  obligations:  NONE 

Turn  holder:  system 

Intended  speech  acts:  NONE 

Unacknowledged  speech  acts:  INFORM- 1 

Discourse  goals:  get-travel-goal,  create-travel-plan 

After  the  utterance,  the  dialogue  manager  decides  to  add  two  conver- 
sation acts  to  the  queue;  first,  to  acknowledge  the  user’s  INFORM  act  (via 
‘address  grounding  situation’),  and  second,  to  ask  the  next  question  of  the 
user  (via  ‘address  goals’).  This  reasoning  would  be  worked  out  by  the  sys- 
tem's STRIPS-style  planner  as  described  on  page  743;  given  the  goal  get- 
travel-goal , the  REQUEST-INFO  action  schema  tells  the  system  that  asking 
the  user  something  is  one  way  of  finding  it  out.  The  result  of  adding  these 
two  conversation  acts  is 

Intended  speech  acts:  REQUEST-INFORM- 1,  ACKNOWLEDGE- 1 

These  would  be  combined  by  a very  clever  generator  into  the  single 
utterance: 

A2:  And,  what  day  in  May  did  you  want  to  travel? 

Note  that  the  grounding  function  was  achieved  both  by  beginning  with  the 
discourse  marker  and  and  by  repeating  back  the  month  name  May.  The 
request  for  information  is  achieved  via  the  wh-question. 

Let’s  skip  ahead  to  the  client’s  utterance  C4.  Recall  that  C4  is  an  indi- 
rect request,  asking  the  agent  to  check  on  non-stop  flights. 

A3:  And  what  time  would  you  like  to  leave  Pittsburgh? 
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C4:  Uh  hmm  I don’t  think  there’s  many  options  for  non-stop. 

Let’s  assume  that  our  dialogue  act  interpreter  correctly  interprets  C4  as 
REQUEST-INFORM-3.  The  state  of  the  agent  after  client  utterance  C4  is  then: 

Discourse  obligations:  address(REQUEST-INFORM-3) 

Turn  holder:  system 

Intended  speech  acts:  NONE 

Unacknowledged  speech  acts:  REQUEST-INFORM-3 
Discourse  goals:  get-travel-goal,  create-travel-plan 

The  dialogue  manager  will  first  address  the  discourse  obligation  of  re- 
sponding to  the  user’s  request  by  calling  the  planner  to  find  out  how  many 
non-stop  flights  there  arc.  The  system  must  now  answer  the  question,  but 
must  also  ground  the  user’s  utterance.  For  a direct  request,  the  response  is 
sufficient  grounding.  For  an  indirect  request,  an  explicit  acknowledgement 
is  an  option;  since  the  indirect  request  was  in  the  form  of  a negative  check 
question,  the  form  of  acknowledgement  will  be  right  (no  would  have  also 
been  appropriate  for  acknowledging  a negative.  These  two  acts  will  then  be 
pulled  off  the  queue  and  passed  to  the  generator: 

A4:  Right.  There’s  three  non-stops  today. 

Dialogue  managers  also  will  need  to  deal  with  the  kind  of  dialogue 
structure  discussed  in  Section  19.4,  both  to  recognize  when  the  user  has 
started  a subdialogue,  and  to  know  when  to  initiate  a subdialogue  itself. 


19.6  SUMMARY 

Dialogue  is  a special  kind  of  discourse  which  is  particularly  relevant  to 
speech  processing  tasks  like  conversational  agents  and  automatic  meet- 
ing summarization. 

• Dialogue  differs  from  other  discourse  genres  in  exhibiting  turn-taking, 
grounding,  and  implicature. 

• An  important  component  of  dialogue  modeling  is  the  interpretation  of 
dialogue  acts.  We  introduced  plan-based  and  cue-based  algorithms 
for  this. 

• Dialogue  exhibits  intentional  structure  in  addition  to  the  informa- 
tional structure,  including  such  relations  as  dominance  and  satisfaction- 
precedence. 


Methodology  Box:  Evaluating  Dialogue  Systems 


Many  of  the  metrics  that  have  been  proposed  for  evaluating  di- 
alogue systems  can  be  grouped  into  the  following  three  classes: 

1.  User  Satisfaction:  Usually  measured  by  interviewing  users 
(Stifelman  et  al,  1993;  Yankelovich  et  al,  1995)  or  having  them 
fill  out  questionnaires  asking  e.g.  (Shriberg  et  al,  1992;  Polifroni 
et  al,  1992): 

• Were  answers  provided  quickly  enough? 

• Did  the  system  understand  your  requests  the  first  time? 

• Do  you  think  a person  unfamiliar  with  computers  could  use  the 
system  easily? 

2.  Task  Completion  Cost: 

• completion  time  in  turns  or  seconds  (Polifroni  et  al,  1992). 

• number  of  queries  (Polifroni  et  al,  1992). 

• number  of  system  non-responses  (Polifroni  et  al,  1992)  or 
‘turn  correction  ratio’:  the  number  of  system  or  user  turns  that 
were  used  solely  to  correct  errors,  divided  by  the  total  num- 
ber of  turns  (Danieli  and  Gerbino,  1995;  Hirschman  and  Pao, 
1993). 

• inappropriateness  (verbose  or  ambiguous)  of  system's  ques- 
tions, answers,  and  error  messages  (Zue  et  al,  1989). 

3.  Task  Completion  Success: 

• percent  of  subtasks  that  were  completed  (Polifroni  et  al, 
1992). 

• correctness  (or  parti  al  correctness)  of  each  question,  answer, 
error  message  (Zue  et  al,  1989;  Polifroni  et  al,  1992). 

• correctness  of  the  total  solution  (Polifroni  et  al,  1992). 

How  should  these  metrics  be  combined  and  weighted?  The 
PARADISE  algorithm  (Walker  et  al,  1997)  (PARAdigm  for  Dia- 
logue System  Evaluation)  applies  multiple  regression  to  this  prob- 
lem. The  algorithm  first  uses  questionnaires  to  assign  each  dialogue 
a user  satisfaction  rating.  A set  of  cost  and  success  factors  like  those 
above  is  then  treated  as  a set  of  independent  factors;  multiple  regres- 
sion is  used  to  train  a weight  (coefficient)  for  each  factor,  measuring 
its  importance  in  accounting  for  user  satisfaction.  The  resulting  met- 
ric can  be  used  to  compare  quite  different  dialogue  strategies. 
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• Dialogue  managers  for  conversational  agents  range  from  simple  template  - 
or  frame -based  production  systems  to  complete  BDI  (belief-desire- 
intention)  models. 


Bibliographical  and  Historical  Notes 

Early  work  on  speech  and  language  processing  had  very  little  emphasis  on 
the  study  of  dialogue.  One  of  the  earliest  conversational  systems,  ELIZA, 
had  only  a trivial  production  system  dialogue  manager;  if  the  human  user’s 
previous  sentence  matched  the  regular-expression  precondition  of  a possible 
response,  ELIZA  simply  generated  that  response  (Weizenbaum,  1966).  The 
dialogue  manager  for  the  simulation  of  the  paranoid  agent  PARRY  (Colby 
et  al,  1971),  was  a little  more  complex.  Like  ELIZA,  it  was  based  on  a pro- 
duction system,  but  where  ELIZA's  rules  were  based  only  on  the  words  in  the 
user’s  previous  sentence,  PARRY’s  rules  also  rely  on  global  variables  indi- 
cating its  emotional  state.  Furthermore,  PARRY’s  output  sometimes  makes 
use  of  script-like  sequences  of  statements  when  the  conversation  turns  to  its 
delusions.  For  example,  if  PARRY’s  anger  variable  is  high,  he  will  choose 
from  a set  of  ‘hostile’  outputs.  If  the  input  mentions  his  delusion  topic,  he 
will  increase  the  value  of  his  fear  variable  and  then  begin  to  express  the 
sequence  of  statements  related  to  his  delusion. 

The  appearance  of  more  sophisticated  dialogue  managers  awaited  the 
better  understanding  of  human-human  dialogue.  Studies  of  the  properties 
of  human-human  dialogue  began  to  accumulate  in  the  1970's  and  1980’s. 
The  Conversation  Analysis  community  (Sacks  et  ah,  1974;  Jefferson,  1984; 
Schegloff,  1982)  began  to  study  the  interactional  properties  of  conversation. 
Grosz’s  (1977c)  dissertation  significantly  influenced  the  computational  study 
of  dialogue  with  its  introduction  of  the  study  of  substructures  in  dialogues 
(subdialogues),  and  in  particular  with  the  finding  that  “task-oriented  dia- 
logues have  a structure  that  closely  parallels  the  structure  of  the  task  being 
performed.”  (p.  27).  The  BDI  model  integrating  earlier  AI  planning  work 
(Fikes  and  Nilsson,  1971)  with  speech  act  theory  (Austin,  1962;  Gordon  and 
Lakoff,  1971;  Searle,  1975a)  was  first  worked  out  by  Cohen  and  Perrault 
(1979),  showing  how  speech  acts  could  be  generated,  and  Perrault  and  Allen 
(1980)  and  Allen  and  Perrault  (1980),  applying  the  approach  to  speech-act 
interpretation. 

The  cue -based  model  of  dialogue  act  interpretation  was  inspired  by 
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Hinkelman  and  Allen  (1989),  who  showed  how  lexical  and  phrasal  cues 
could  be  integrated  into  the  BDI  model,  and  by  the  work  on  microgram- 
mar in  the  Conversation  Analysis  literature  (e.g.  Goodwin,  1996).  It  was 
worked  out  at  a number  of  mainly  speech  recognition  labs  around  the  world 
in  the  late  1990's  (e.g.  Nagata  and  Morimoto,  1994;  Suhm  and  Waibel,  1994; 
Mast  et  al.,  1996;  Jurafsky  et  al.,  1997;  Warnke  et  al. , 1997;  Reithinger  and 
Klesen,  1997;  Taylor  et  al.,  1998). 

Models  of  dialogue  as  collaborative  behavior  were  introduced  in  the 
late  1980's  and  1990’s,  including  the  ideas  of  reference  as  a collaborative 
process  (Clark  and  Wilkes-Gibbs,  1986),  and  models  of  joint  intentions 
(Levesque  et  al.,  1990),  and  shared  plans  (Grosz  and  Sidner,  1980)).  Re- 
initiative  lated  to  this  area  is  the  study  of  initiative  in  dialogue,  studying  how  the 
dialogue  control  shifts  between  participants  Walker  and  Whittaker  (1990), 
Smith  and  Gordon  (1997). 


Exercises 

19.1  List  the  dialogue  act  misinterpretations  in  the  Who ’s  On  First  routine 
at  the  beginning  of  the  chapter. 

19.2  Write  a finite-state  automaton  for  a dialogue  manager  for  checking 
your  bank  balance  and  withdrawing  money  at  an  automated  teller  machine. 

19.3  Dispreferred  responses  (for  example  turning  down  a request)  arc  usu- 
ally signaled  by  surface  cues,  such  as  significant  silence.  Try  to  notice  the 
next  time  you  or  someone  else  utters  a dispreferred  response,  and  write  down 
the  utterance.  What  arc  some  other  cues  in  the  response  that  a system  might 
use  to  detect  a dispreferred  response?  Consider  non-verbal  cues  like  eye- 
gaze  and  body  gestures. 

19.4  When  asked  a question  to  which  they  aren't  sure  they  know  the  an- 
swer, people  use  a number  of  cues  in  their  response.  Some  of  these  cues 
overlap  with  other  dispreferred  responses.  Try  to  notice  some  unsure  an- 
swers to  questions.  What  arc  some  of  the  cues?  If  you  have  trouble  doing 
this,  you  may  instead  read  Smith  and  Clark  (1993)  which  lists  some  such 
cues,  and  try  instead  to  listen  specifically  for  the  use  of  these  cues. 

19.5  The  sentence  Do  you  have  the  ability  to  pass  the  salt?  is  not  generally 
interpretable  as  a question.  Why  is  this  a problem  for  the  BDI  model? 
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19.6  Most  universities  require  Wizard-of-Oz  studies  to  be  approved  by  a 
human  subjects  board,  since  they  involve  deceiving  the  subjects.  It  is  a good 
idea  (indeed  it  is  often  required)  to  ‘debrief’  the  subjects  afterwards  and  tell 
them  the  actual  details  of  the  task.  Discuss  your  opinions  of  the  moral  issues 
involved  in  the  kind  of  deceptions  of  experimental  subjects  that  take  place  in 
Wizard-of-Oz  studies. 

19.7  Implement  a small  air-travel  help  system.  Your  system  should  get 
constraints  from  the  user  about  a particular-  flight  that  they  want  to  take, 
expressed  in  natural  language,  and  display  possible  flights  on  a screen.  Make 
simplifying  assumptions.  You  may  build  in  a simple  flight  database  or  you 
may  use  an  flight  information  system  on  the  web  as  your  backend. 

19.8  Augment  your  previous  system  to  work  over  the  phone  (or  alterna- 
tively, describe  the  user  interface  changes  you  would  have  to  make  for  it  to 
work  over  the  phone).  What  were  the  major  differences? 

19.9  Design  a simple  dialog  system  for  checking  your  email  over  the  tele- 
phone. Assume  that  you  had  a synthesizer  which  would  read  out  any  text 
you  gave  it,  and  a speech  recognizer  which  transcribed  with  perfect  accu- 
racy. If  you  have  a speech  recognizer  or  synthesizer,  you  may  actually  use 
them  instead. 

19.10  Test  your  email-reading  system  on  some  potential  users.  If  you  don’t 
have  an  actual  speech  recognizer  or  synthesizer,  simulate  them  by  acting  as 
the  recognizer/synthesizer  yourself.  Choose  some  of  the  metrics  described 
in  the  Methodology  Box  on  page  754  and  measure  the  performance  of  your 
system. 
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hello,  world 

Kernighan  & Ritchie,  The  C Programming  Language 

...  you,  MR  KEITH  V LINDEN,  will  be  a millionaire 
January  31! 

From  a junk  mailing 


In  one  sense,  language  generation  is  the  oldest  subfield  of  language 
processing.  When  computers  were  able  to  understand  only  the  most  unnat- 
ural of  command  languages,  they  were  spitting  out  natural  texts.  For  exam- 
ple, the  oldest  and  most  famous  C program,  the  “hello,  world”  program,  is 
a generation  program.  It  produces  useful,  literate  English  in  context.  Unfor- 
tunately, whatever  subtle  or  sublime  communicative  force  this  text  holds  is 
produced  not  by  the  program  itself  but  by  the  author  of  that  program.  This 
approach  to  generation,  called  canned  text,  is  easy  to  implement,  but  is  un-  canned  text 
able  to  adapt  to  new  situations  without  the  intervention  of  a programmer. 

Language  generation  is  also  the  most  pervasive  subfield  of  language 
processing.  Who  of  us  has  not  received  a form  letter  with  our  name  carefully 
inserted  in  just  the  right  places,  along  with  eloquent  appeals  for  one  thing  or 
another.  This  sort  of  program  is  easy  to  implement  as  well,  but  I doubt 
if  many  arc  fooled  into  thinking  that  such  a letter  is  hand-written  English. 

The  inflexibility  of  the  mechanism  is  readily  apparent  when  our  names  arc 
mangled,  as  mine  is  in  the  junk  mailing  shown  above,  or  when  other  obvious 
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mistakes  arc  made.1  This  approach,  called  template  filling,  is  more  flexible 
than  canned  text  and  has  been  used  in  a variety  of  applications,  but  is  still 
limited.  For  example,  Weizenbaum's  use  of  templates  in  ELIZA  worked 
well  in  some  situations,  but  produced  nonsense  in  others.2 

The  success  of  simple  generation  mechanisms  indicates  that,  to  a first 
approximation,  language  generation  is  easier  than  language  understanding. 
A language  understanding  system  cannot  generally  control  the  complexity 
of  the  language  structures  it  receives  as  input,  while  a generation  system  can 
limit  the  complexity  of  the  structure  of  its  output.  Because  of  this,  work  in 
language  processing  initially  focussed  on  language  understanding,  assuming 
that  any  generation  that  needed  to  be  done  could  easily  be  handled  with 
canned  text  or  template  filling  mechanisms.  Unfortunately,  these  simple 
mechanisms  arc  not  flexible  enough  to  handle  applications  with  any  real- 
istic variation  in  the  information  being  expressed  and  in  the  context  of  its 
expression.  Even  the  generation  used  in  the  limited  domain  of  the  “hello, 
world”  program  could  use  more  flexibility.  It  might  be  more  appropriate  for 
the  program  to  produce: 

(20.1)  Congratulations,  you’ve  just  compiled  and  run  a simple  C program 
which  means  that  your  environment  is  configured  properly. 

This  text  is  more  complex  than  the  original  and  we  can  see  a number  of  po- 
tential variations.  If  the  readers  arc  experienced  systems  engineers,  then  we 
might  choose  not  to  congratulate  them  on  compiling  a program.  Doing  so 
might  insult  them.  In  addition,  there  arc  a number  of  other  ways  of  referring 
to  the  program  in  question,  including  “the  program”,  “your  first  C program”, 
“it”  and  “that  lame  excuse  for  code”,  each  of  which  might  or  might  not  be 
appropriate  in  a given  situation.  Note  also  that  we  didn’t  need  to  conjoin 
“compiled  and  run”  and  we  could  have  expressed  the  conclusion  (“your  en- 
vironment is  configured  properly”)  and  then  the  evidence  (“you’ve  compiled 
and  run  a simple  C program”).  These  variations  arc  not  necessarily  synony- 
mous. Language  has  evolved  a variety  of  ways  of  expressing  things,  all  of 

1 A recent  example  of  this  arose  when  parents  of  a college-bound  student  received  an  accep- 
tance letter  that  read  “Congratulations  on  987-65-4321’s  admission  ...  as  a parent  you  will  be 
a partner  with  the  university  in  encouraging  987-65-4321  to  succeed."  (from  the  Associated 
Press,  March  26,  1999). 

2 It’s  not  difficult  to  “trick”  ELIZA  into  producing  nonsense,  as  shown  in  the  following  dia- 
log with  ELIZA  4.3:  HUMAN:  “Let’s  talk  just  about  you  — not  me.  Can  you  think?”  ELIZA: 
“You  like  to  think  I — not  you  — don’t  you”  . . . HUMAN : “I  think  you  are  an  extremely  stupid 
program.”  ELIZA:  “Perhaps  you  would  like  to  be  an  extremely  stupid  program”  (Copeland, 
1993,  page  40). 
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which  arc  useful  for  expressing  certain  things  in  certain  contexts.  Therefore, 
to  be  useful  in  more  complex  environments,  a generation  system  must  be 
capable  of:  (1)  producing  an  appropriate  range  of  forms,  and  (2)  choosing 
among  those  forms  based  on  the  intended  meaning  and  the  context.  In  this 
chapter,  we  will  study  the  basic  language  generation  techniques  used  to  solve 
these  problems,  ignoring  canned  text  and  template -based  mechanisms. 

20. 1 Introduction  to  Language  Generation 


Language  understanding  is  somewhat  like  counting  from  one  to 
infinity;  language  generation  is  like  counting  from  infinity  to  one. 
Yorick  Wilks,  quoted  in  (Dale  et  al.,  1998a,  page  352) 

Generation  from  what?! 

attributed  to  Christopher  Longuet-Higgins 


Natural  Language  Generation  (NLG)  is  the  process  of  constructing  Generation 
natural  language  outputs  from  non-linguistic  inputs.  The  goal  of  this  process  NATURAL 
can  be  viewed  as  the  inverse  of  that  of  natural  language  understanding  underage 
(NLU)  in  that  NLG  maps  from  meaning  to  text,  while  NLU  maps  from  text  STAND1NG 
to  meaning.  In  doing  this  mapping,  generation  visits  many  of  the  same  lin- 
guistic issues  discussed  in  the  previous  chapters,  but  the  inverse  orientation 
distinguishes  its  methods  from  those  of  NLU  in  two  important  ways. 

First,  the  nature  of  the  input  to  the  generation  process  varies  widely 
from  one  application  to  the  next.  Although  the  linguistic  input  to  NLU  sys- 
tems may  vary  from  one  text  type  to  another,  all  text  is  governed  by  relatively 
common  grammatical  rules.  This  is  not  the  case  for  the  input  to  generation 
systems.  Each  generation  system  addresses  a different  application  with  a dif- 
ferent input  specification.  One  system  may  be  explaining  a complex  set  of 
numeric  tables  while  another  may  be  documenting  the  structure  of  an  object- 
oriented  software  engineering  model.  As  a result,  generation  systems  must 
extract  the  information  necessary  to  drive  the  generation  process. 

Second,  while  both  NLU  and  NLG  must  be  able  to  represent  a range 
of  lexical  and  grammatical  forms  required  for  the  application  domain,  their 
use  of  these  representations  is  different.  NLU  has  been  characterized  as  a 
process  of  hypothesis  management  in  which  the  linguistic  input  is  sequen- 
tially scanned  as  the  system  considers  alternative  interpretations.  Its  domi- 
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nant  concerns  include  ambiguity,  under-specification,  and  ill-formed  input. 
These  concerns  arc  not  generally  addressed  in  generation  research  because 
they  don’t  arise.  The  non-linguistic  representations  input  to  an  NLG  sys- 
tem tend  to  be  relatively  unambiguous,  well-specified,  and  well-formed.  In 
contrast,  the  dominant  concern  of  NLG  is  choice.  Generation  systems  must 
make  the  following  choices: 

• Content  selection  — The  system  must  choose  the  appropriate  content 
to  express  from  a potentially  over-specified  input,  basing  its  decision 
on  a specific  communicative  goal.  For  example,  we  noted  that  some  of 
the  content  included  in  example  20. 1 might  not  be  appropriate  for  all 
readers.  If  the  goal  was  to  indicate  that  the  environment  is  set  up,  and 
the  reader  was  a systems  engineer,  then  we’d  probably  express  only  the 
last  clause. 

• Lexical  selection  — The  system  must  choose  the  lexical  item  most 
appropriate  for  expressing  particular  concepts.  In  example  20.1,  for 
instance,  it  must  choose  between  the  word  “configured”  and  other  po- 
tential forms  including  “set  up”. 

• Sentence  structure 

- Aggregation  — The  system  must  apportion  the  selected  content 
into  phrase,  clause,  and  sentence-sized  chunks.  Example  20.1 
combined  the  actions  of  compiling  and  running  into  a single  phrase. 

- Referring  expressions  — The  system  must  determine  how  to  re- 
fer to  the  objects  being  discussed.  As  we  saw,  the  decision  on 
how  to  refer  to  the  program  in  example  20. 1 was  not  trivial. 

• Discourse  structure  — NLG  systems  frequently  deal  with  multi-sentence 
discourse,  which  must  have  a coherent,  discernible  structure.  Exam- 
ple 20. 1 included  two  propositions  in  which  it  was  clear  that  one  was 
giving  evidence  for  the  other. 

These  issues  of  choice,  taken  together  with  the  problem  of  actually  putting 
lineal-  sequences  of  words  on  paper,  form  the  core  of  the  field  of  NLG. 
Though  it  is  a relatively  young  field,  it  has  begun  to  develop  a body  of  work 
directed  at  this  core.  This  chapter  will  introduce  this  work.  It  will  begin  by 
presenting  a simple  architecture  for  NLG  systems  and  will  then  proceed  to 
discuss  the  techniques  commonly  used  in  the  components  of  that  architec- 
ture. 
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Figure  20.1  A reference  architecture  for  NLG  systems 


20.2  An  Architecture  for  Generation 

The  nature  of  the  architecture  appropriate  for  accomplishing  the  tasks  listed 
in  the  previous  section  has  occasioned  much  debate.  Practical  considera- 
tions, however,  have  frequently  led  to  the  architecture  shown  in  Figure  20. 1 . 

This  architecture  contains  two  pipelined  components: 

• Discourse  Planner  - This  component  starts  with  a communicative  plannerse 
goal  and  makes  all  the  choices  discussed  in  the  previous  section.  It 
selects  the  content  from  the  knowledge  base  and  then  structures  that 
content  appropriately.  The  resulting  discourse  plan  will  specify  all  the 
choices  made  for  the  entire  communication,  potentially  spanning  mul- 
tiple sentences  and  including  other  annotations  (including  hypertext, 
figures,  etc.). 

• Surface  Realizer  — This  component  receives  the  fully  specified  dis-  realIzer 
course  plan  and  generates  individual  sentences  as  constrained  by  its 
lexical  and  grammatical  resources.  These  resources  define  the  real- 

izer’s  potential  range  of  output.  If  the  plan  specifies  multiple-sentence 
output,  the  surface  realizer  is  called  multiple  times. 
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FUNCTION 

FORM 


This  is  by  no  means  the  only  architecture  that  has  been  proposed  for  NLG 
systems.  Other  potential  mechanisms  include  Al-style  planning  and  black- 
board architectures.  Neither  is  this  architecture  without  its  problems.  The 
simple  pipeline,  for  example,  doesn’t  allow  decisions  made  in  the  planner  to 
be  reconsidered  during  surface  realization.  Furthermore,  the  precise  bound- 
ary between  planning  and  realization  is  not  altogether  clear.  Nevertheless, 
we  will  use  it  to  help  organize  this  chapter.  We’ll  start  by  discussing  the  sur- 
face realizer,  the  most  developed  of  the  two  components,  and  then  proceed 
to  the  discourse  planner. 


Surface  Realization 

The  surface  realization  component  produces  ordered  sequences  of  words  as 
constrained  by  the  contents  of  a lexicon  and  grammar.  It  takes  as  input 
sentence-sized  chunks  of  the  discourse  specification.  This  section  will  in- 
troduce two  of  the  most  influential  approaches  used  for  this  task:  Systemic 
Grammar  and  Functional  Unification  Grammar.  Both  of  these  approaches 
will  be  used  to  generate  the  following  example: 

(20.2)  The  system  will  save  the  document. 

There  is  no  general  consensus  as  to  the  level  at  which  the  input  to 
the  surface  realizer  should  be  specified.  Some  approaches  specify  only  the 
propositional  content,  so  in  the  case  of  example  20.2,  the  discourse  plan 
would  specify  a saving  action  done  by  a system  entity  to  a document  entity. 
Other  approaches  go  so  far  as  to  include  the  specification  of  the  grammatical 
form  (in  this  case,  a future  tense  assertion)  and  lexical  items  (in  this  case, 
“save”,  “system”,  and  “document”). 

As  we  will  see,  systems  using  the  two  approaches  discussed  in  this  sec- 
tion take  input  at  different  levels.  One  thing  they  have  in  common,  however, 
is  that  they  take  input  that  is  functionally  specified  rather  than  syntactically 
specified.  This  fact,  which  is  typical  of  generation  systems,  has  tended  to 
preclude  the  use  of  the  syntactic  formalisms  discussed  earlier  in  this  book. 
Generation  systems  start  with  meaning  and  context,  so  it  is  most  natural  to 
specify  the  intended  output  in  terms  of  function  rather  than  of  form.  Ex- 
ample 20.2,  for  instance,  could  be  stated  in  either  active  or  passive  form. 
Discourse  planners  tend  not  to  work  with  these  syntactic  terms.  They  arc 
more  likely  to  keep  track  of  the  focus  or  local  topic  of  the  discourse,  and 
thus  it  is  more  natural  to  specify  this  distinction  in  terms  of  focus.  So  in 
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the  example,  if  the  document  is  the  local  topic  of  the  discourse,  it  would  be 
marked  as  the  focus  which  could  trigger  the  use  of  the  passive.  As  we  will 
see,  both  of  the  approaches  discussed  here  categorize  grammar  in  functional 
terms. 


Systemic  Grammar 

Systemic  grammar  is  part  of  Systemic-Functional  linguistics,  a branch  of  functional 
linguistics  that  views  language  as  a resource  for  expressing  meaning  in  con- 
text (Halliday,  1985b).  Systemic  grammars  represent  sentences  as  collec- 
tions of  functions  and  maintain  rules  for  mapping  these  functions  onto  ex- 
plicit grammatical  forms.  This  approach  is  well-suited  to  generation  and  has 
thus  been  widely  influential  in  NLG.  This  section  will  start  with  an  example 
of  systemic  sentence  analysis.  It  will  then  discuss  a simple  systemic  gram- 
mar and  apply  it  to  the  running  example. 

Systemic  sentence  analyses  organize  the  functions  being  expressed  in 
multiple  “layers”,  as  shown  in  this  analysis  of  example  20.2: 


The  system 

will 

save 

the  document 

Mood 

subject 

finite 

predicator 

object 

Transitivity 

actor 

process 

goal 

Theme 

theme 

rheme 

Here,  the  mood  layer  indicates  a simple  declarative  structure  with  subject,  fi- 
nite (auxiliary),  predicator  (verb)  and  object.  The  transitivity  layer  indicates 
that  the  “system”  is  the  actor,  or  doer,  of  the  process  of  “saving”,  and  that  the 
goal,  or  object  acted  upon,  is  the  “document”.3  The  theme  layer  indicates 
that  the  “system”  is  the  theme,  or  focus  of  attention,  of  the  sentence.4  Notice 
that  the  three  layers  deal  with  different  sets  of  functions.  These  three  sets, 
called  meta-functions,  represent  three  fundamental  concerns  in  generation: 

• The  interpersonal  meta-function  groups  those  functions  that  estab- 
lish and  maintain  the  interaction  between  the  writer  and  the  reader.  It 
is  represented  here  by  the  mood  layer,  which  determines  whether  the 
writer  is  commanding,  telling,  or  asking. 

• The  ideational  meta-function  is  concerned  with  what  is  commonly 


META- 

FUNCTIONS 

INTERPER- 

SONAL 

META- 

FUNCTION 


IDEATIONAL 

META- 

FUNCTION 


3 These  thematic  roles  are  discussed  in  Chapter  16. 

4 The  concepts  of  theme  and  rheme  were  developed  by  the  Prague  school  of  linguistics 
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TEXTUAL 

META- 

FUNCTION 


SYSTEM 

NETWORK 


REALIZATION 

STATEMENTS 


called  the  “propositional  content”  of  the  expression.  Here,  the  transi- 
tivity layer  determines  the  nature  of  the  process  being  expressed  and 
the  variety  of  case  roles  that  must  be  expressed.  Note  that  this  meta- 
function covers  much  of  what  is  commonly  termed  “semantics”. 

• The  textual  meta-function  is  concerned  with  the  way  in  which  the  ex- 
pression fits  into  the  current  discourse.  This  includes  issues  of  thema- 
tization  and  reference.  In  our  example,  the  theme  layer  represents  this 
in  that  it  explicitly  marks  “the  system”  as  the  theme  of  the  sentence. 

This  explicit  concern  for  interpersonal  and  textual  issues  as  well  as  tradi- 
tional semantics  is  another  feature  of  systemic  linguistics  that  is  attractive 
for  NLG.  Many  of  the  choices  that  generation  systems  must  make  depend  on 
the  context  of  communication,  which  is  formalized  by  the  interpersonal  and 
textual  metafunctions. 

A systemic  grammar  is  capable  of  building  a sentence  structure  such 
as  the  one  just  shown.  The  grammar  is  represented  using  a directed,  acyclic, 
and/or  graph  called  a system  network.  Figure  20.2  illustrates  a simple  sys- 
tem network.  Here,  the  large  curly  brace  indicates  “and”  (i.e.,  parallel)  sys- 
tems, while  the  straight  vertical  lines  represent  “or”  (i.e.,  disjoint)  systems. 
Thus,  every  clause  (represented  as  the  highest  level  feature  on  the  far  left) 
will  simultaneously  have  a set  of  features  for  mood,  transitivity  and  theme, 
but  will  either  be  indicative  or  imperative  but  not  both.  Although  the  sys- 
tem network  formalism  doesn’t  require  the  use  of  systemic  theory,  we  will 
loosely  base  this  sample  grammar  on  systemic  categorizations.  With  respect 
to  this  grammar,  example  20.2  is  an  indicative,  declarative  clause  expressing 
an  active  material  process  with  an  unmarked  theme. 

A systemic  grammar  uses  realization  statements  to  map  from  the 
features  specified  in  the  grammar  (e.g..  Indicative,  Declarative)  to  syntac- 
tic form.  Each  feature  in  the  network  can  have  a set  of  realization  statements 
specifying  constraints  on  the  final  form  of  the  expression.  These  arc  shown 
in  Figure  20.2  as  a set  of  italicized  statements  below  each  feature.  Realiza- 
tion statements  allow  the  grammar  to  constrain  the  structure  of  the  expres- 
sion as  the  system  network  is  traversed.  They  arc  specified  using  a simple 
set  of  operators  shown  here: 

+X  Insert  the  function  X.  For  example,  the  grammar  in  Figure  20.2  speci- 
fies that  all  clauses  will  have  a predicator. 

X /Y  Conflate  the  functions  X and  Y.  This  allows  the  grammar  to  build  a 


(Firbas,  1966). 
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Mood 


Transitivity 


Theme 


Declarative 


Indicative 

Indicative  ► 

+subject  Type 
subject  > predicator 
+finite 

finite  > predicator 
finite  / auxiliary 
subject  / noun  phrase 


subject  > finite 

Interrogative 

Interrogative  ^ 

Type 

finite  > subject 


Wh- 

+ question 
question  > finite 
question  / Wh- 


Polar 


Imperative 

predicator  / infinitive 

Voice 

Material  Process  

+goal 

+process 

process  = finite,  predicator 

Relational  Process 


Active 

+actor 

actor  = subject 
+object 
object  = goal 
predicator  > object 
object  / noun  phrase 

Passive 

goal  = subject 
finite  / be 

predicator  / past-participle 


Unmarked  Theme 

+ theme  + rheme 
theme  = subject 
rheme  = predicator,  object 


Marked  Theme 


Figure  20.2  A simple  systemic  grammar 


layered  function  structure  by  assigned  different  functions  to  the  same 
portion  of  the  expression.  For  example,  active  clauses  conflate  the 
actor  with  the  subject,  while  passive  clauses  conflate  the  goal  with  the 
subject. 

X'  Y Order  function  X somewhere  before  function  Y.  For  example,  indica- 
tive sentences  place  the  subject  before  the  predicator. 

X : A Classify  the  function  X with  the  lexical  or  grammatical  feature  A.  These 
classifications  signal  a recursive  pass  through  the  grammar  at  a lower 
level.  The  grammar  would  include  other  networks  similar  to  the  clause 
network  that  apply  to  phrases,  lexical  items,  and  morphology.  As  an 
example,  note  that  the  indicative  feature  inserts  a subject  function  that 
must  be  a noun  phrase.  This  phrase  will  be  further  specified  by  another 
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pass  through  the  grammar. 

X \L  Assign  function  X the  lexical  item  L.  In  Figure  20.2,  the  finite  element 
of  the  passive  is  assigned  the  lexical  item  “be”. 

Given  a fully  specified  system  network,  the  procedure  for  generation  is  to: 

1.  Traverse  the  network  from  left  to  right,  choosing  the  appropriate  fea- 
tures and  collecting  the  associated  realization  statements; 

2.  Build  an  intermediate  expression  that  reconciles  the  constraints  set  by 
the  realization  statements  collected  during  this  traversal; 

3.  Recurse  back  through  the  grammar  at  a lower  level  for  any  function 
that  is  not  fully  specified; 

To  illustrate  this  process,  we  will  use  the  sample  grammar  to  generate  exam- 
ple 20.2  (“The  system  will  save  the  document”).  We  will  use  the  following 
specification  as  input:5 


: process 
: actor 
: goal 

: speechact 
: tense 
) 


save-1 

system-1 

document-1 

assertion 

future 


Here,  the  save-1  knowledge  base  instance  is  identified  as  the  process  of 
the  intended  expression.  We  will  assume  all  knowledge  base  objects  to 
be  KLONE-styled  instances  (Brachman,  1979)  for  which  proper  lexical  en- 
tries exist.  The  actor  and  goal  arc  similarly  specified  as  system-1  and 
document- 1 respectively.  The  input  also  specifies  that  the  expression  be 
in  the  form  of  an  assertion  in  the  future  tense. 

The  generation  process  stalls  with  the  clause  feature  in  Figure  20.2,  in- 
serting a predicator  and  classifying  it  as  a verb.  It  then  proceeds  to  the  mood 
system.  The  correct  option  for  a system  is  chosen  by  a simple  query  or  de- 
cision network  associated  with  that  system.  The  query  or  decision  network 
bases  its  decision  on  the  relevant  information  from  the  input  specification 
and  from  the  knowledge  base.  In  this  case,  the  mood  system  chooses  the 
indicative  and  declarative  features  because  the  input  specifies  an  assertion. 


5 This  input  specification  is  loosely  based  on  the  spl-constructor  interface  to  the  PENMAN 
system  (Mann,  1983),  a systemic  generation  system.  The  Sentence  Planning  Language 
(SPL),  a more  flexible  input  language,  is  discussed  in  the  bibliographical  notes  below. 
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The  realization  statements  associated  with  the  indicative  and  declarative  fea- 
tures will  insert  subject  and  finite  functions,  and  order  them  as  subject  then 
finite  then  predicator.  The  resulting  function  structure  would  be  as  follows: 


Mood 


subject 


finite 


predicator 


We  will  assume  that  the  save-1  action  is  marked  as  a material  process 
in  the  knowledge  base,  which  causes  the  transitivity  system  to  choose  the 
material  process  feature.  This  inserts  the  goal  and  process  functions,  and 
conflates  the  process  with  the  finite/predicator  pair.  Because  there  is  no  indi- 
cation in  either  the  input  or  the  knowledge  base  to  use  a passive,  the  system 
chooses  the  active  feature,  which:  (1)  inserts  the  actor  and  conflates  it  with 
the  subject,  and  (2)  inserts  the  object,  conflating  it  with  the  goal  and  ordering 
it  after  the  predicator.  This  results  in: 


Mood 

Transitivity 


subject 

finite 

predicator 

object 

actor 

process 

goal 

Finally,  because  there  is  no  thematic  specification  in  the  input,  the  theme 
network  chooses  unmarked  theme,  which  inserts  theme  and  rheme,  conflat- 
ing theme  with  subject  and  conflating  rheme  with  the  finite/predicator/object 
group.  This  results  in  the  full  function  structure  discussed  above  (repeated 
here): 


Mood 

Transitivity 

Theme 


subject 

finite  predicator 

object 

actor 

process 

goal 

theme 

rheme 

At  this  point,  the  generation  process  recursively  enters  the  grammar  a num- 
ber of  times  at  lower  levels  to  fully  specify  the  phrases,  lexical  items,  and 
morphology.  The  noun  phrase  network  will  use  a process  like  the  one  shown 
here  to  create  “the  system”  and  “the  document”.  Systems  in  the  auxiliary 
network  will  insert  the  lexical  item  “will”.  The  choice  of  the  lexical  items 
“system”,  “document”,  and  “save”  can  be  handled  in  a number  of  ways,  most 
typically  by  retrieving  the  lexical  item  associated  with  the  relevant  knowl- 
edge base  instances. 
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Functional  Unification  Grammar 

Functional  Unification  Grammar  uses  unification  (discussed  in  Chapter  1 1) 
to  manipulate  and  reason  about  feature  structures  (Kay,  1979).  With  a few 
modifications,  this  technique  can  be  applied  to  NLG.  The  basic  idea  is  to 
build  the  generation  grammar  as  a feature  structure  with  lists  of  potential  al- 
ternations, and  then  to  unify  this  grammar  with  an  input  specification  built 
using  the  same  sort  of  feature  structure.  The  unification  process  then  takes 
the  features  specified  in  the  input  and  reconciles  them  with  those  in  the  gram- 
mar, producing  a full  feature  structure  which  can  then  be  linearized  to  form 
sentence  output. 

In  this  section  we  will  illustrate  this  mechanism  by  generating  exam- 
ple 20.2  again.  We  will  use  the  simple  functional  unification  grammar  shown 
in  Figure  20.3.  This  grammar,  expressed  as  an  attribute-value  matrix  (cf. 
Chapter  11),  supports  simple  transitive  sentences  in  present  or  future  tense 
and  enforces  subject-verb  agreement  on  number.  We’ll  now  walk  through 
the  structure,  explaining  the  features. 

At  its  highest  level,  this  grammar  provides  alternatives  for  sentences 
(cat  s),  noun  phrases  (cat  np)  and  verb  phrases  (cat  vp).  This  alternation 
is  specified  with  the  alt  feature  on  the  far  left.  We  use  the  curly  braces  to 
indicate  that  any  one  of  the  three  enclosed  alternatives  may  be  followed.  This 
level  also  specifies  a pattern  that  indicates  the  order  of  the  features  specified 
at  this  level,  in  this  case,  actor,  process,  then  goal. 

At  the  sentence  level,  this  grammar  supports  actor,  process,  and  goal 
features  which  arc  prespecified  as  NP,  VP  and  NP  respectively.  Subject- verb 
agreement  on  number  is  enforced  using  the  number  feature  inside  the  process 
feature.  Here  we  see  that  the  number  of  the  process  must  unify  with  the  path 
{actor  number}.  A path  is  a list  of  features  specifying  a path  from  the  root  to 
a particular  feature.  In  this  case,  the  number  of  the  process  must  unify  with 
the  number  of  the  actor.  While  this  path  is  given  explicitly,  we  can  also  have 
relative  paths  such  as  the  number  feature  of  the  head  feature  of  the  NP.  The 
path  here,  {t  t number},  indicates  that  the  number  of  the  head  of  the  noun 
phrase  must  unify  with  the  number  of  the  feature  2 levels  up.  We’ll  see  how 
this  is  useful  in  the  example  below. 

The  VP  level  is  similar  in  nature  to  the  NP  level  except  that  it  has  its 
own  alternation  between  present  and  future  tense.  Given  the  tense,  which  we 
will  see  specified  in  the  input  feature  structure,  the  unification  will  select  the 
alternation  that  matches  and  then  proceed  to  unify  the  associated  features.  If 
the  tense  is  present,  for  example,  the  head  will  be  single  verb.  If,  on  the  other 
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hand,  the  tense  is  future,  we  will  insert  the  modal  auxiliary  “will”  before  the 
head  verb. 

This  grammar  is  similar  to  the  systemic  grammar  from  the  previous 
section  in  that  it  supports  multiple  levels  that  arc  entered  recursively  during 
the  generation  process.  We  now  turn  to  the  input  feature  structure,  which 
specifies  the  details  of  the  particular  sentence  we  want  to  generate.  The 
input  structure,  called  a functional  description  (FD),  is  a feature  structure 
just  like  the  grammar.  An  FD  for  example  20.2  is  as  follows: 


CAT 
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HEAD 

LEX  SYSTEM 
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PROCESS 


HEAD 

TENSE 


|LEX  SAVE 
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HEAD 
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Here  we  see  a sentence  specification  with  a particular  actor,  the  system,  and 
a particular  goal,  the  document.  The  process  is  the  saving  of  the  document 
by  the  system  in  the  future.  The  input  structure  specifies  the  particular  verbs 
and  nouns  to  be  used  as  well  as  the  tense.  This  differs  from  the  input  to 
the  systemic  grammar.  In  the  systemic  grammar,  the  lexical  items  were  re- 
trieved from  the  knowledge  base  entities  associated  with  the  actor  and  goal. 
The  tense,  though  not  included  in  the  example  systemic  grammar,  would 
be  determined  by  a decision  network  that  distinguishes  the  relative  points 
in  time  relevant  to  the  content  of  the  expression.  This  unification  grammar, 
therefore,  requires  that  more  decisions  be  made  by  the  discourse  planning 
component. 

To  produce  the  output,  this  input  is  unified  with  the  grammar  shown  in 
Figure  20.3.  This  requires  multiple  passes  through  the  grammar.  The  pre- 
liminary unification  unifies  the  input  FD  with  the  “S”  level  in  the  grammar 
(i.e.,  the  first  alternative  at  the  top  level).  The  result  of  this  process  is  as 
follows: 
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Here  we  see  that  the  features  specified  in  the  input  structure  have  been 
merged  and  unified  with  the  features  at  the  top  level  of  the  grammar.  For 
example,  the  features  associated  with  “actor”  include  the  lexical  item  “sys- 
tem” from  the  input  FD  and  the  category  “np”  from  the  grammar.  Similarly, 
the  process  feature  combines  the  lexical  item  and  tense  from  the  input  FD 
with  the  category  and  number  features  from  the  grammar. 

The  generation  mechanism  now  recursively  enters  the  grammar  for 
each  of  the  sub-constituents.  It  enters  the  NP  level  twice,  once  for  the  actor 
and  again  for  the  goal,  and  it  enters  the  VP  level  once  for  the  process.  The 
FD  that  results  from  this  is  shown  in  Figure  20.4.  There  we  see  that  every 
constituent  feature  that  is  internally  complex  has  a pattern  specification,  and 
that  every  simple  constituent  feature  has  a lexical  specification.  The  system 
now  uses  the  pattern  specifications  to  linearize  the  output,  producing  “The 
system  will  save  the  document.” 

This  particular  example  did  not  specify  that  the  actor  be  plural.  We 
could  do  this  by  adding  the  feature-value  pair  “number  plural”  to  the  actor 
structure  in  the  input  FD.  Subject-verb  agreement  would  then  be  enforced 
by  the  unification  process.  The  grammar  requires  that  number  of  the  heads 
of  the  NP  and  the  VP  match  with  the  number  of  the  actor  that  was  specified 
in  the  input  FD.  The  details  of  this  process  are  left  as  an  exercise. 
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Summary 

The  two  surface  generation  grammars  we’ve  seen  in  this  section  illustrate  the 
nature  of  computational  grammars  for  generation.  Both  used  functional  cat- 
egorizations. One  might  wonder  if  it  would  be  possible  to  use  a single  gram- 
mar for  both  generation  and  understanding.  These  grammars,  called  bidi- 
rectional grammars,  arc  currently  under  investigation  but  have  not  found  tonal 
widespread  use  in  NLG  (cf.  Chapter  21).  This  is  largely  due  to  the  additional 
semantic  and  contextual  information  required  as  input  to  the  generator. 

20.4  Discourse  Planning 

The  surface  realization  component  discussed  in  the  previous  section  takes 
a specified  input  and  generates  single  sentences.  Thus,  it  has  little  or  no 
control  over  either  the  discourse  structure  in  which  the  sentence  resides  or  the 
content  of  the  sentence  itself.  These  things  arc  controlled  by  the  discourse 
planner.  This  section  will  introduce  the  two  predominant  mechanisms  for 
building  discourse  structures:  text  schemata  and  rhetorical  relations. 

The  focus  on  discourse  rather  than  just  sentences  has  been  a key  fea- 
ture of  much  work  done  in  NLG.  Many  applications  require  that  the  system 
produce  multi-sentence  or  multi-utterance  output.  This  can  be  done  by  sim- 
ply producing  a sentence  for  each  component  of  the  intended  meaning,  but 
frequently  more  care  is  required  in  selecting  and  structuring  the  meaning  in 
an  appropriate  way.  For  example,  consider  the  following  alternate  revision 
of  the  “hello,  world”  output  discussed  in  the  introduction: 

(20.3)  You’ve  just  compiled  a simple  C program.  You’ve  just  run  a simple 

C program.  Your  environment  is  configured  properly. 

These  sentences  arc  fine  in  isolation,  but  the  text  is  more  disjointed  than  the 
one  given  in  example  20. 1 and  is  probably  harder  to  understand.  Although 
it  orders  the  sentences  in  a helpful  way,  it  doesn’t  give  any  indication  of  the 
relationship  between  them.  These  arc  the  sorts  of  issues  that  drive  discourse 
planning. 

This  section  will  also  discuss  the  closely  related  problem  of  content 
selection,  which,  as  we  saw  earlier,  is  the  process  of  selecting  propositional 
content  from  the  input  knowledge  base  based  on  a communicative  goal.  Be- 
cause the  form  of  this  knowledge  base  and  the  nature  of  the  communicative 
goal  varies  widely  from  one  application  to  another,  it  is  difficult  to  make 
general  statements  about  the  content  selection  process.  To  make  things 
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Figure  20.5  A portion  of  the  saving  procedure  knowledge  base 

more  concrete,  therefore,  this  section  will  focus  on  the  task  of  generating 
instructions  for  a simple  word-processing  application.  We’ll  assume  that 
the  knowledge  base,  whatever  its  underlying  structure,  can  be  viewed  as  a 
KLONE-styled  knowledge  base.  We'll  also  assume  that  the  communicative 
goal  is  to  explain  the  represented  procedure  to  a new  user  of  the  system. 
The  knowledge  base  will  represent  the  procedure  for  saving  a tile  as  a sim- 
ple procedural  hierarchy,  as  shown  in  Figure  20.5.  The  procedure  specified 
there  requires  that  the  user  choose  the  save  option  from  the  file  menu,  se- 
lect the  appropriate  folder  and  file  name,  and  then  click  on  the  save  button. 
As  a side-effect,  the  system  automatically  displays  and  removes  the  save-as 
dialog  box  in  response  to  the  appropriate  user  actions.  This  representation 
gives  the  procedural  relationships  between  the  basic  actions  but  it  doesn’t 
show  any  of  the  domain  knowledge  concerning  the  structure  of  the  interface 
(e.g.,  which  choices  arc  on  which  menus)  or  the  particular  entities  that  arc 
used  in  the  procedure  (e.g.,  the  document,  the  user).  We’ll  assume  that  these 
arc  accessible  in  the  knowledge  base  as  well. 

Text  Schemata 

Apart  from  the  rigidly  structured  canned  texts  and  slot-filler  templates  dis- 
cussed in  the  opening  of  this  chapter,  the  simplest  way  to  build  texts  is  to 
key  the  text  structure  to  the  structure  of  the  input  knowledge  base.  For  ex- 
ample, we  might  choose  to  describe  a game  of  tic-tac-toe  or  checkers  by 
reviewing  the  moves  in  the  sequence  in  which  they  were  taken.  This  strategy 
soon  breaks  down,  however,  when  we  have  a large  amount  of  information 
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that  could  potentially  be  expressed  in  order  to  achieve  a variety  of  commu- 
nicative goals.  The  knowledge  base  that  contains  the  fragment  shown  in 
Figure  20.5,  for  example,  could  be  expressed  as  a sequence  of  instructions 
such  as  one  might  find  in  a tutorial  manual,  or  it  could  be  expressed  as  an 
alphabetized  set  of  program  functions  such  as  one  might  find  in  a reference 
manual. 

One  approach  to  this  problem  rests  on  the  observation  that  texts  tend 
to  follow  consistent  structural  patterns.  For  example,  written  directions  ex- 
plaining how  to  carry  out  an  activity  typically  express  the  required  actions 
in  the  order  of  their  execution.  Any  preconditions  of  these  actions  arc  men- 
tioned before  the  appropriate  action.  Similarly,  side-effects  of  these  actions 
are  mentioned  after  the  appropriate  action.  In  some  domains,  patterns  such 
as  these  arc  rarely  broken.  Armed  with  this  information,  we  can  build  a 
schema  representing  this  structure,  such  as  the  one  shown  in  Figure  20.6. 
This  schema  is  represented  as  an  augmented  transition  network  (ATN)  in 
which  each  node  is  a state  and  each  arc  is  an  optional  transition  (see  Chap- 
ter 10).  Control  starts  in  the  small  black  node  in  the  upper  left  and  proceeds 
to  follow  arcs  as  appropriate  until  execution  stops  in  the  terminal  node  of  the 
lower  left.  Node  SO  allows  the  expression  of  any  number  of  preconditions. 
Transitioning  to  S 1 forces  the  expression  of  the  action  itself.  S 1 allows  re- 
cursive calls  to  the  network  to  express  any  sub-steps.  The  transition  to  S2 
requires  no  action,  and  S2  allows  any  number  of  side-effects  to  be  expressed 
before  halting  execution. 

We  can  use  this  schema  to  plan  the  expression  of  the  example  proce- 
dure shown  in  Figure  20.5.  When  the  system  is  asked  to  describe  how  to 
save  a document,  the  procedure  schema  can  be  activated.  We’ll  assume  that 
the  knowledge  base  specifies  no  preconditions  for  the  action  of  saving  a file, 
so  we  proceed  directly  to  state  S 1 , forcing  the  expression  of  the  main  action: 
“Save  the  document”.  In  state  S2,  we  recursively  call  the  network  for  each 
of  the  four  sub-steps  specified  in  the  input.  This  expresses  the  first  sub-step, 
“choose  the  save  option”,  along  with  its  side-effect,  “this  causes  the  system 
to  display  the  save-as  dialog  box”.  The  first  sub-step  has  no  preconditions 
or  sub-steps.  Each  of  the  other  sub-steps  is  done  in  the  same  manner  and 
execution  finally  returns  to  the  main  action  execution  in  step  S2  which  ex- 
presses the  result  of  the  whole  process,  “this  causes  the  system  to  save  the 
document”  and  then  terminates.  Depending  on  the  details  of  the  planning, 
the  final  text  might  be  as  follows: 


SCHEMA 

AUGMENTED 

TRANSITION 

NETWORK 


Save  the  document:  First,  choose  the  save  option  from  the 
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tile  menu.  This  causes  the  system  to  display  the  Save-As  dialog 
box.  Next,  choose  the  destination  folder  and  type  the  filename. 
Finally,  press  the  save  button.  This  causes  the  system  to  save  the 
document. 


Each  one  of  these  sentences  can  be  generated  using  one  of  the  surface  realiz- 
es discussed  in  the  previous  section.  As  we  can  see,  the  schema  mechanism 
is  more  flexible  than  templates  or  canned  text.  It  structures  the  output  accord- 
ing to  known  patterns  of  expression,  but,  with  appropriate  constraints,  is  able 
to  insert  optional  material  collected  from  the  knowledge  base  in  a variety  of 
orders.  In  addition,  it  is  not  required  to  express  everything  in  the  knowledge 
base;  the  side-effect  of  the  “click  save  button”  action,  for  example,  was  not 
included. 

This  schema  mechanism  produced  only  a high-level  discourse  struc- 
ture. The  problem  of  specifying  of  the  detailed  form  of  each  of  the  sentences, 
commonly  called  microplanning,  is  discussed  in  Section  20.5. 
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Rhetorical  Relations 

Schemata  arc  useful  for  discourse  planning  provided  a discrete  set  of  consis- 
tent patterns  of  expression  can  be  found  and  encoded.  However,  they  suffer 
from  two  basic  problems.  First,  they  become  impractical  when  the  text  be- 
ing generated  requires  more  structural  variety  and  richness  of  expression. 
For  example,  we  may  find  that  certain  conditions  dictate  that  we  format  our 
procedural  instructions  in  a different  manner.  Some  contexts  may  dictate  that 
we  explicitly  enumerate  the  steps  in  the  procedure,  or  that  we  express  cer- 
tain segments  of  the  text  in  a different  manner  or  in  a different  order.  While 
in  principle  these  variations  could  be  supported  either  by  adding  constraints 
and  operational  code  to  the  schema  or  by  adding  new  schemata,  the  more 
variations  that  arc  required,  the  more  difficult  the  schema-based  approach 
becomes. 

The  second  problem  with  schema-based  mechanisms  is  that  the  dis- 
course structure  they  produce  is  a simple  sequence  of  sentence  generation 
requests.  It  includes  no  higher-level  structure  relating  the  sentences  together. 
In  some  domains,  particularly  in  interactive  ones  (cf.  Chapter  19),  the  struc- 
ture of  the  previous  discourse  is  relevant  for  future  planning.  For  example, 
if  we  have  explained  a process  in  some  detail,  we  might  not  want  to  do  it 
again.  It’s  easier  to  do  these  things  when  there  is  a record  of  the  structure  of 
previous  discourse. 

A useful  approach  here  is  to  take  a look  under  the  hood  of  the  schema  in 
order  to  discover  the  more  fundamental  rhetorical  dynamics  at  work  in  a text. 
A system  informed  by  these  dynamics  could  develop  its  own  schemata  based 
on  the  situations  it  confronts.  A number  of  theories  that  attempt  to  formalize 
these  rhetorical  dynamics  have  been  proposed,  as  discussed  in  some  detail 
in  Chapter  18.  One  such  theory.  Rhetorical  Structure  Theory  (RST),  is  a 
descriptive  theory  of  text  organization  based  on  the  relationships  that  hold 
between  parts  of  the  text  (Mann  and  Thompson,  1987b).  As  an  example, 
consider  the  following  two  texts: 

(20.4)  I love  to  collect  classic  automobiles.  My  favorite  car  is  my  1899 

Duryea. 

(20.5)  I love  to  collect  classic  automobiles.  My  favorite  car  is  my  1999 

Toyota. 

The  first  text  makes  sense.  The  fact  that  the  writer  likes  the  1899  Duryea 
follows  naturally  from  the  fact  that  they  like  classic  automobiles.  The  sec- 
ond text,  however,  is  problematic.  The  problem  is  not  with  the  individual 
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sentences,  they  work  perfectly  well  in  isolation.  Rather,  the  problem  is  with 
their  combination.  The  fact  that  the  two  sentences  arc  in  sequence  implies 
that  there  is  some  coherent  relationship  between  them.  In  the  case  of  the 
first  text,  that  relationship  could  be  characterized  as  one  of  elaboration  (cf. 
Chapter  19).  The  second  text  could  be  characterized  as  one  of  contrast  and 
would  thus  be  more  appropriately  expressed  as: 

(20.6)  I love  to  collect  classic  automobiles.  However,  my  favorite  car  is  my 
1999  Toyota. 

Here,  the  “however”,  overtly  signals  the  contrast  relation  to  the  reader.  RST 
claims  that  an  inventory  of  23  rhetorical  relations,  including  ELABORATION 
and  CONTRAST,  is  sufficient  to  describe  the  rhetorical  structure  a wide  vari- 
ety of  texts.  In  practice,  analysts  tend  to  make  use  of  a subset  of  the  relations 
that  arc  appropriate  for  their  domain  of  application. 

Most  RST  relations  designate  a central  segment  of  text  (“I  love  to  col- 
lect. . . ”),  called  the  nucleus,  and  a more  peripheral  segment  (“My  favorite 
cai-  is. . . ”),  called  the  satellite.  This  encodes  the  fact  that  many  rhetorical  re- 
lations are  asymmetric.  Here  the  second  text  is  being  interpreted  in  terms  of 
the  first,  and  not  vice-versa.  As  we  will  see  below,  not  all  rhetorical  relations 
are  asymmetric.  RST  relations  are  defined  in  terms  of  the  constraints  they 
place  on  the  nucleus,  on  the  satellite,  and  on  the  combination  of  the  nucleus 
and  satellite.  Here  are  definitions  of  some  common  RST  relations: 

ELABORATION  — The  satellite  presents  some  additional  detail  concerning 
the  content  of  the  nucleus.  This  detail  may  be  of  many  forms: 

• a member  of  a given  set 

• an  instance  of  a given  abstract  class 

• a part  of  a given  whole 

• a step  of  a given  process 

• an  attribute  of  a given  object 

• a specific  instance  of  a given  generalization 

CONTRAST  — The  nuclei  present  things  that,  while  similar  in  some  re- 
spects, are  different  in  some  relevant  way.  This  relation  is  multi-nuclear  in 
that  it  doesn’t  distinguish  between  a nucleus  and  a satellite. 

CONDITION  — The  satellite  presents  something  that  must  occur  before  the 
situation  presented  in  the  nucleus  can  occur. 

PURPOSE  — The  satellite  presents  the  goal  of  performing  the  activity  pre- 
sented in  the  nucleus. 
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SEQUENCE  — This  relation  is  multi-nuclear.  The  set  of  nuclei  arc  realized 
in  succession. 

RESULT  — The  situation  presented  in  the  nucleus  results  from  the  one  pre- 
sented in  the  satellite. 

RST  relations  arc  typically  graphed  as  follows: 


classic  automobiles,  is  my  1899  Duryea. 

Here  we  see  a graphical  representation  of  the  rhetorical  relation  from  exam- 
ple 20.4.  The  segments  of  text  arc  ordered  sequentially  along  the  bottom  of 
the  diagram  with  the  rhetorical  relations  built  above  them.  The  individual 
text  segments  arc  usually  clauses. 

Rhetorical  structure  analyses  arc  built  up  hierarchically,  so  we  may 
use  one  pair  of  related  texts  as  a satellite  or  nucleus  in  another  higher-level 
relation.  Consider  the  following  three-sentence  structure: 


Here  we  see  that  the  first  two  clauses  arc  related  to  one  another  via  an  elabo- 
ration relationship,  and  are  related,  as  a pair,  to  the  third  clause  via  a contrast 
relationship.  Note  also  how  the  multi-nuclear  contrast  relation  is  depicted. 
Recursive  structuring  such  as  this  allows  RST  to  build  a single  analysis  tree 
for  extended  texts. 

Although  RST  was  originally  proposed  as  a descriptive  tool,  it  can  also 
be  used  as  a constructive  tool  for  NLG.  In  order  to  do  this,  the  rhetorical 
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relations  arc  typically  recast  as  operators  for  an  Al-style  planner.  As  an 
example  of  this,  we  will  look  at  a general-purpose,  top-down,  hierarchical 
planner  that  can  be  used  for  rhetorically-based  text  planning.6 

The  basic  approach  with  this  sort  of  planner  is  for  the  generation  sys- 
tem to  post  a high  level  communicative  goal  stated  in  terms  of  the  effect  that 
the  text  should  have  on  the  reader.  For  our  instructional  text  example,  we  will 
request  that  the  planner  build  a structure  to  achieve  the  goal  of  making  the 
reader  competent  to  save  a file.  The  highest  level  plan  operator  that  achieves 
this  goal  will  insert  a rhetorical  node  appropriate  for  the  goal  and  insert  sub- 
goals for  the  nucleus  and  satellite  of  that  rhetorical  relation.  These  sub-goals 
will  then  be  recursively  expanded  until  the  planning  process  reaches  the  bot- 
tom of  the  rhetorical  structure  tree,  inserting  a node  that  can  be  expressed  as 
a simple  clause. 

For  our  example,  we  would  post  the  goal: 

(COMPETENT  hearer  (DO-ACTION  <some-action>)) 

Here,  the  communcative  goal  is  to  make  the  hearer  competent  to  do  some 
action.  The  action  would  be  represented  as  an  instance  in  the  knowledge 
base,  in  this  case,  as  the  root  node  from  the  procedural  hierarchy  shown  in 
Figure  20.5.  A text  plan  operator  that  would  fire  for  this  goal  would  be  as 
follows: 

Name:  Expand  Purpose 
Effect: 

(COMPETENT  hearer  (DO-ACTION  ?action)) 

Constraints: 

(AND 

(c-get-all-substeps  ?action  ?sub-actions) 

(NOT  (singular-list?  ?sub-actions)) 

Nucleus: 

(COMPETENT  hearer  (DO-SEQUENCE  ?sub-actions)) 

Satellites: 

(((RST-PURPOSE  (INFORM  s hearer  (DO  ?action))) 

^required*)) 


The  basic  idea  of  this  plan  operator  is  to  explain  how  to  do  a particular  action 
(“?action”)  by  explaining  how  to  do  its  substeps  (“?substeps”).  Note  that  the 
effect  field  matches  the  goal  we  posted  earlier.  An  operator  is  applicable 


6 This  text  planner  is  adapted  from  the  work  of  Moore  and  Paris  (1993). 
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when  its  constraints  hold.  In  this  case,  the  main  action  (“?action”)  must  have 
more  than  one  sub-action.  Because  this  is  true  in  the  current  example  (see 
Figure  20.5),  the  operator  inserts  a rhetorical  purpose  node  into  the  discourse 
structure  along  with  the  goal  specifications  for  its  satellite  and  nucleus.  The 
satellite  informs  the  hearer  of  the  purpose  of  performing  the  main  action, 
and  the  nucleus  lists  the  sub-actions  required  to  achieve  this  goal.  Note  that 
the  effect,  constraints,  nucleus  and  satellite  fields  of  the  operator  make  use 
of  variables  (identifiers  starting  with  “?”)  that  arc  unified  when  the  operator 
is  applied.  Thus,  the  goal  action  is  bound  to  “?action”  and  can  be  accessed 
throughout  the  rest  of  the  plan  operator. 

One  other  thing  to  notice  about  the  plan  operator  is  the  way  in  which 
content  selection  is  done.  The  constraint  field  specifies  that  there  must  be 
substeps  and  that  there  must  be  more  than  one  of  them.  Determining  whether 
the  first  constraint  holds  requires  that  the  system  retrieve  the  sub-steps  from 
the  knowledge  base.  These  sub-steps  arc  then  used  as  the  content  of  the 
nucleus  node  that  is  constructed.  Thus,  the  plan  operators  themselves  do  the 
content  selection  as  required  by  the  discourse  planning  process. 

The  full  text  structure  produced  by  the  planner  is  shown  in  Figure  20.7. 
The  root  node  of  this  free  (i.e.,  the  horizontal  line  at  the  very  top)  is  the  node 
produced  by  the  previous  plan  operator.  The  first  nucleus  node  in  Figure  20.7 
is  the  multi-nuclear  node  comprising  all  the  sub-actions.  The  plan  operator 
that  produces  this  node  is  as  follows: 


Name:  Expand  Sub-Actions 

Effect: 

(COMPETENT  hearer  (DO-SEQUENCE  ?actions)) 

Constraints: 

NIL 

Nucleus: 

(foreach  ?actions  (RST-SEQUENCE 

(COMPETENT  hearer  (DO-ACTION  ?actions)))) 

Satellites: 

NIL 


This  operator  achieves  the  nucleus  goal  posted  by  the  previous  operator.  It 
posts  a rhetorical  node  with  multiple  nuclei,  one  for  each  sub-action  required 
to  achieve  the  main  goal.  With  an  appropriate  set  of  plan  operators,  this 
planning  system  can  produce  the  discourse  structure  shown  in  Figure  20.7, 
which  could  then  be  linearized  into  the  following  text: 
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Choose  Show  Click  Save 

Save  Save-as  Save  Document 

Dialog  box 

Figure  20.7  The  full  rhetorical  structure  for  the  example  text 


To  save  a new  file 

1 . Choose  the  save  option  from  the  file  menu. 

The  system  will  display  the  save-file  dialog  box. 

2.  Choose  the  folder. 

3.  Type  the  file  name. 

4.  Click  the  save  button. 

The  system  will  save  the  document. 

All  of  these  sentences  can  be  generated  by  a surface  realizer.  The  last 
one,  in  particular,  was  identified  as  example  20.2  in  the  previous  sections.  As 
mentioned  in  the  section  on  schema-based  discourse  planning,  the  problem 
of  microplanning  has  been  deferred  to  Section  20.5. 

Summary 

In  this  section,  we  have  seen  how  schema-based  mechanisms  can  take  ad- 
vantage of  consistent  patterns  of  discourse  structure.  Although  this  approach 
has  proven  effective  in  the  many  contexts,  it  is  not  flexible  enough  to  handle 
more  varied  generation  tasks.  Discourse  planning  based  on  rhetorical  rela- 
tions was  introduced  to  add  the  flexibility  required  to  handle  these  sorts  of 
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tasks. 


20.5  Other  Issues 


This  section  introduces  issues  that  were  not  discussed  in  detail  in  the  previ- 
ous sections. 


Microplanning 


The  previous  sections  did  not  detail  the  process  of  mapping  from  the  dis- 
course plans  described  in  the  examples  to  the  inputs  to  the  surface  realizers. 
The  discourse  structures,  such  as  the  one  shown  in  Figure  20.7,  specified 
the  high-level  or  macro  structure  of  the  text,  but  few  of  the  details  expected 
as  input  to  the  surface  realizers.  The  problem  of  doing  this  more  detailed 
planning  is  called  microplanning. 

In  most  generation  applications,  microplanning  is  simply  hard-wired. 
For  example,  in  instruction  generation  systems,  objects  can  be  referred  to  in 
the  same  way  in  all  cases,  and  user  actions  can  be  expressed  as  separate  im- 
perative sentences.  This  greatly  simplifies  the  problem,  but  tends  to  produce 
monotonous  texts  such  as  the  one  shown  in  example  20.3.  This  illustrates 
two  of  the  primary  areas  of  concern  in  microplanning:  referring  expres- 
sions and  aggregation. 

Planning  a referring  expression  requires  that  we  determine  those  as- 
pects of  an  entity  that  should  be  used  when  referring  to  that  entity  in  a par- 
ticular context.  If  the  object  is  the  focus  of  discussion  and  has  just  been 
mentioned,  we  might  be  able  to  use  a simple  “it”,  whereas  introducing  a new 
entity  may  require  more  elaborate  expressions  like  “a  new  document  to  hold 
your  term  paper”.  These  issues  arc  discussed  in  some  detail  in  Chapter  18. 

Aggregation  is  the  problem  of  apportioning  the  content  from  the  knowl- 
edge base  into  phrase,  clause,  and  sentence-sized  chunks.  We  saw  an  exam- 
ple of  this  in  the  introduction  where  two  of  the  actions  mentioned  in  exam- 
ple 20.1  were  conjoined  within  the  first  clause  as  “you’ve  just  compiled  and 
run  a simple  C program”.  This  is  more  readable  than  the  non-aggregated 
version  given  in  example  20.3  (“You’ve  just  compiled  a simple  C program. 
You’ve  just  run  a simple  C program”). 

Microplanning  is  frequently  seen  as  an  intermediate  pipelined  mod- 
ule placed  between  the  discourse  planner  and  the  surface  realizer  (see  Fig- 
ure 20.1)  (Reiter  and  Dale,  2000).  Indeed,  more  recent  work  has  emphasized 
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microplanning  to  the  point  that  it  is  viewed  as  a task  of  importance  equal  to 
that  of  discourse  planning  and  surface  realization.  It  is  also  possible  to  add 
planning  operators  to  the  RST-based  planning  mechanism  described  in  the 
chapter  in  order  to  perform  microplanning  tasks.  However  the  microplan- 
ning is  done,  it  serves  to  map  from  the  output  of  the  discourse  planner  to  the 
input  of  the  surface  realizer. 

Lexical  Selection 

Lexical  selection  refers  to  the  general  problem  of  choosing  the  appropriate 
words  with  which  to  express  the  chosen  content.  The  surface  realizers  dis- 
cussed in  this  chapter  explicitly  inserted  closed-class  lexical  items  as  they 
were  required,  but  deferred  the  choice  of  the  content  words  to  the  discourse 
planner.  Many  planners  simplify  this  issue  by  associating  a single  lexical 
item  with  each  entity  in  the  knowledge  base. 

Handling  lexical  selection  in  a principled  way  requires  that  the  gener- 
ation system  deal  with  two  issues.  First,  it  must  be  able  to  choose  the  appro- 
priate lexical  item  when  more  than  one  alternative  exists.  In  the  document- 
saving text  from  the  previous  section,  for  instance,  the  system  generated 
“Click  the  save  button”.  There  arc  alternatives  to  the  lexical  item  “click”, 
including  “hit”  and  “press  mouse  left  on”.  The  choice  between  these  alterna- 
tives could  consider:  (1)  style  — in  this  case  “hit”  is  perhaps  more  informal 
that  “click”,  (2)  collocation  — in  this  case  “click”  probably  co-occurs  with 
buttons  more  often  in  this  domain,  and  (3)  user  knowledge  — in  this  case  a 
novice  computer  user  might  need  the  more  fully  specified  “press  mouse  left 
on”. 

Second,  the  generation  system  must  be  able  to  choose  the  appropriate 
grammatical  form  for  the  expression  of  the  concept.  For  example,  the  system 
could  title  the  section  “Saving  a new  file”  rather  than  “To  save  a new  file”. 
This  choice  between  the  participle  and  the  infinitive  form  is  frequently  made 
based  on  the  forms  most  commonly  employed  in  a corpus  of  instructions. 

Evaluating  Generation  Systems 

In  early  work  on  NLG,  the  quality  of  the  output  of  the  system  was  assessed 
by  the  system  builders  themselves.  If  the  output  sounded  good,  then  the  sys- 
tem was  judged  a success.  Because  this  is  not  a very  effective  test  of  system 
quality,  much  recent  interest  has  been  focussed  on  the  rigorous  evaluation  of 
NLG  systems.  Several  techniques  have  emerged. 
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One  technique  is  to  statistically  compare  the  output  of  the  generator 
with  the  characteristics  of  a corpus  of  target  text.  If  the  form  chosen  by 
the  generator  matches  the  form  most  commonly  used  in  the  corpus,  it  is 
judged  as  correct.  The  danger  with  this  approach  is  that  the  corpus  is  usually 
produced  by  writers  that  may  make  errors,  thus  skewing  the  corpus  statistics. 

The  assumption  is  that,  as  Tolstoy  put  it  (Tolstoy,  1977),  “All  happy  families 
arc  alike,  but  an  unhappy  family  is  unhappy  after  its  own  fashion.”  In  other 
words,  good  text  displays  a consistent  set  of  characteristics  that  arise  again 
and  again,  while  bad  text  displays  idiosycratic  characteristics  that  will  not 
accumulate  statistically. 

Another  technique  is  to  convene  a panel  of  experts  to  judge  the  output 
of  the  generator  in  comparison  with  text  produced  by  human  authors.  In  this 
variation  of  the  Turing  test,  the  judges  do  not  know  which  texts  were  gen- 
erated by  the  system  and  which  were  written  by  human  authors.  Computer 
generated  text  typically  scores  lower  than  human  written  text,  but  its  quality 
approaches  that  of  human  authors  in  some  restricted  domains. 

A final  technique  is  to  judge  how  effective  the  generated  text  is  at 
achieving  its  goal.  For  example,  if  the  text  is  intended  to  describe  some 
object,  its  quality  can  be  measured  in  terms  of  how  well  readers  score  on 
a content  quiz  given  after  reading  the  output  text.  If  the  text  is  intended  to 
explain  how  to  perform  some  process,  its  quality  can  be  measured  in  terms 
of  the  number  of  procedural  errors  made  by  the  reader  after  reading  the  text. 

Generating  Speech 

This  chapter  has  focussed  on  generating  text  rather  than  on  generating  speech. 

There  arc,  however,  many  situations  in  which  speech  output  is  preferable  if 
not  absolutely  necessary.  These  include  situations  where  there  is  no  textual 
display,  such  as  when  the  user  is  using  a telephone,  and  situations  where  the 
users  arc  unable  to  look  at  a textual  display,  such  as  when  the  user  is  driving 
or  when  the  user  is  disabled. 

A simplistic  approach  might  be  to  pass  the  word  string  that  is  produced 
by  a generation  system  to  a text-to-speech  synthesizer  of  the  sort  described 
in  Chapter  4,  Chapter  5,  and  Chapter  7.  One  problem  with  this  approach  was 
already  discussed  on  page  120  and  page  601:  text-to-speech  systems  must 
then  deal  with  homographs  (i.e.,  words  with  the  same  spelling  but  different  graphs 
pronunciations).  Consider  the  following  example: 

(20.7)  Articulate  people  can  clearly  articulate  the  issues. 

Here,  the  two  instances  of  the  spelling  “articulate”  must  be  pronounced  dif- 
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ferently.  Another  problem  is  the  treatment  of  prosody,  which  requires  that 
appropriate  pitch  contours  and  stress  patterns  be  assigned  to  the  speech  being 
produced. 

The  simplistic  approach  requires  the  text-to-speech  system  to  solve 
both  of  these  problems  by  analyzing  the  input  text.  Homographs  can  fre- 
quently be  distinguished  using  part-of-speech  tagging  (the  adjective  and  verb 
forms  of  "articulate”  arc  pronounced  differently)  or  by  the  word-sense  dis- 
ambiguation algorithms  of  Chapter  17.  As  Chapter  4 (page  ??)  suggests, 
automatic  generation  of  prosody  is  a much  harder  problem.  Some  prosodic 
information  can  be  deduced  by  distinguishing  questions  from  non-questions, 
and  by  looking  for  commas  and  periods.  In  general,  however,  it  is  not  easy 
to  extract  the  required  information  from  the  input  text. 

An  alternative  to  the  simplistic  approach  is  to  pass  a richer  representa- 
tion from  the  NLG  system  to  the  speech  synthesizer.  A typical  NLG  system 
knows  the  semantics  and  paid  of  speech  of  the  word  it  intends  to  generate, 
and  can  annotate  the  word  with  this  information  to  help  select  the  proper 
word  pronunciation.  The  system  could  also  annotate  the  output  with  dis- 
course structure  information  to  help  synthesize  the  proper  prosody.  To  date, 
there  has  been  very  little  work  on  this  area  in  NLG. 


20.6  Summary 

Language  Generation  is  the  process  of  constructing  natural  language  outputs 
from  non-linguistic  inputs.  As  a field  of  study,  it  usually  does  not  include  the 
study  of  simpler  generation  mechanisms  such  as  canned  text  and  template 
filling. 

• Language  generation  differs  from  language  understanding  in  that  it  fo- 
cuses on  linguistic  choice  rather  than  on  resolving  ambiguity.  Issues 
of  choice  in  generation  include  content  selection,  lexical  selection, 
aggregation,  referring  expression  generation,  and  discourse  struc- 
turing. 

• Language  generation  systems  include  a component  that  plans  the  struc- 
ture of  the  discourse,  called  a discourse  planner,  and  one  that  gener- 
ates single  sentences,  called  a surface  realizer.  Approaches  for  dis- 
course planning  include  text  schemata  and  rhetorical  relation  plan- 
ning. Approaches  for  surface  realization  include  Systemic  Grammar 
and  Functional  Unification  Grammar. 
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• Microplanners  map  the  discourse  planner  output  to  the  surface  gener- 
ator input,  which  includes  the  tine-grained  tasks  of  referring  expres- 
sion generation,  aggregation,  and  lexical  selection. 


Bibliographical  and  Historical  Notes 

Excluding  canned  text  and  template  tilling  mechanisms,  natural  language 
generation  is  a young  field  relative  to  the  rest  of  language  processing.  Some 
minor  forays  into  the  field  occurred  in  the  50’s  and  60's,  mostly  in  the  con- 
text of  machine  translation,  but  work  focusing  on  generation  didn’t  arise  until 
the  70's.  Simmons  and  Slocum’s  system  (1972)  used  ATN’s  to  generate  dis- 
course from  semantic  networks,  Goldman’s  BABEL  (1975)  used  decision  net- 
works to  perform  lexical  choice,  and  Davey’s  PROTEUS  (1979)  produced  de- 
scriptions of  tic-tac-toe  games.  The  80’s  saw  the  establishment  of  generation 
as  a distinct  field  of  research.  Influential  contributions  on  surface  realization 
were  made  by  McDonald  (1980)  and  the  PENMAN  project  (Mann,  1983),  and 
on  text  planning  by  McKeown  (1985)  and  Appelt  (1985).  The  90's  have  seen 
continuing  interest  with  the  rise  of  generation-focussed  workshops,  both  Eu- 
ropean and  international,  and  organizations  (cf.  the  Special  Interest  Group 
on  language  GENeration,  http://www.aclweb.org/siggen).  Kukich  (1988) 
and  Reiter  and  Dale  (2000)  have  discussed  the  uses  and  limitations  of  canned 
text  and  template  mechanisms. 

As  of  this  writing,  no  textbooks  on  generation  exist.  However,  a text 
on  applied  generation  is  in  press  (Reiter  and  Dale,  2000),  and  a number  of 
survey  papers  have  been  written  (Dale  et  al,  1998a;  Uszkoreit,  1996;  Mc- 
Donald, 1992;  Bateman  and  Hovy,  1992;  McKeown  and  Swartout,  1988).  A 
number  of  these  references  discuss  the  history  of  NLG  and  its  relationship  to 
the  rest  of  language  processing.  McDonald  (1992)  introduces  the  distinction 
between  hypothesis  management  and  choice. 

Generation  architectures  have  typically  pipelined  the  tasks  of  planning 
and  realization.  The  pipelining  is  used  to  constrain  the  search  space  within 
each  of  the  modules  and  thus  to  make  the  generation  task  more  tractable  (Re- 
iter and  Dale,  2000;  McDonald,  1988;  Thompson,  1977).  However,  these 
architectures  have  the  well-known  problem  that  decisions  made  by  the  dis- 
course planner  cannot  easily  be  undone  by  the  realizer  (Meteer,  1992).  Ap- 
pelt’s  KAMP  (1985)  employed  a unified  architecture  for  planning  and  realiza- 
tion based  on  AI  planning.  This  approach,  however,  has  proven  computation- 
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ally  impractical  in  larger  domains.  Blackboard  architectures  have  also  been 
proposed  for  language  generation  systems  (Nirenburg  el  al.,  1989).  The  var- 
ious concerns  of  microplanning  itself  have  been  the  subject  of  considerable 
interest,  including  work  on  referring  expressions  (Dale,  1992;  Appelt,  1985), 
aggregation  (Dalianis,  1999;  Mann  and  Moore,  1981),  and  other  grammat- 
ical issues  (Vander  Linden  and  Martin,  1995;  Meteer,  1992).  The  related 
issues  of  lexical  selection  (Stede,  1998;  Reiter,  1990;  Goldman,  1975)  and 
tailoring  the  output  text  to  particular  audiences  (Paris,  1993;  Hovy,  1988a) 
have  also  received  attention. 

The  late  80's  and  early  90's  saw  the  construction  of  several  reusable 
NLG  systems,  including  two  that  have  been  distributed  publicly:  KPML 
(Bateman,  1997)  and  FUF  (Elhadad,  1993).  These  tools  can  be  downloaded 
through  the  SIGGEN  web  site.  Most  of  this  work  was  done  in  Lisp,  but 
recent  efforts  have  been  made  to  port  the  systems  to  other  languages  and 
platforms. 

Systemic  functional  linguistics  (SFL)  was  developed  by  Halliday  (1985b). 
It  has  remained  largely  independent  of  generative  linguistics  and  is  relatively 
unknown  in  the  language  processing  community  as  a whole.  Attempts  to 
use  it  in  parsing  have  had  limited  success  (O’Donnell,  1994;  Kasper,  1988). 
However,  it  has  had  a deep  impact  on  NLG,  being  used  in  one  form  or  an- 
other by  a number  of  generation  systems,  including  Winograd’s  SHRDLU 
(1972b),  Davey’s  PROTEUS,  Patten’s  SLANG  (1988),  PENMAN  (Mann,  1983), 
FUF  (Elhadad,  1993)  and  ILEX  (Dale  et  al,  1998b).  The  example  systemic 
grammar  in  this  chapter  is  based  in  paid  on  Winograd's  discussion  (1972b). 
SFL’s  most  complete  computational  implementation  is  the  Komet-Penman 
MultiLingual  development  environment  (KPML),  which  is  a descendent  of 
PENMAN.  KPML  is  packaged  with  NIGEL,  a large  English  generation  gram- 
mar, as  well  as  an  environment  for  developing  multilingual  grammars.  It 
also  includes  a Sentence  Planning  Language  (SPL)  that  forms  a more  usable 
interface  to  the  systemic  grammar  itself.  SPL  specifications  arc  considerably 
simpler  to  build  than  specifications  that  must  include  all  the  information  re- 
quired to  make  all  the  choices  in  the  system  network,  but  arc  more  flexible 
that  the  spl-constructor  example  given  in  the  chapter.  Consider  the  following 
SPL  specification: 

(si  / save 

: actor  (al  / system 

: determiner  the) 

(a2  / document 


: actee 
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: determiner  the) 
dense  future 
) 

The  SPL  interpreter  will  expand  this  into  the  series  of  feature  choices  re- 
quired for  the  Nigel  grammar  to  generate  example  20.2  (“The  system  will 
save  the  document.”).  Each  term  in  this  specification  gives  the  role  of  the 
entity  (e.g.,  actor,  actee)  as  well  as  the  semantic  type  (e.g.,  save,  system, 
document).  The  semantic  types  arc  KLONE-styled  concepts  subordinated  to 
a general  ontology  (cf.  Chapter  16)  of  concepts  called  the  upper  model 
(Bateman  et  al.,  1990).  This  ontology,  which  represents  semantic  distinc- 
tions that  have  grammatical  consequences,  is  used  by  SPL  to  determine  the 
type  of  entity  being  expressed  and  thus  to  reduce  the  amount  of  information 
explicitly  contained  in  the  SPL  specification.  This  example  leaves  out  the 
: speechact  assertion  term  included  in  the  example  in  the  chapter 
because  SPL  uses  this  as  a default  value  if  left  unspecified. 

Functional  Unification  Grammar  was  developed  by  Kay  (1979),  see 
Chapter  11.  Its  most  influential  implementation  for  generation  is  the  Func- 
tional Unification  Formalism  (FUF)  developed  by  Elhadad  (Elhadad,  1993, 
1992).  It  is  distributed  with  the  English  grammar  SURGE.  Although  the 
example  given  in  the  chapter  used  a simple  phrase-structure  approach  to 
grammatical  categorization  (cf.  (Elhadad,  1992)),  the  SURGE  grammar  uses 
systemic  categorizations. 

Another  linguistic  theory  that  has  been  influential  in  language  gener- 
ation is  Mel’cuk’s  Meaning  Text  Theory  (MTT)  (1988).  MTT  postulates  a 
number  of  levels  ranging  from  deep  syntax  all  the  way  to  surface  structure. 
Surface  realizers  that  use  it,  including  CoGenTex’s  REALPRO  (Lavoie  and 
Rambow,  1997)  and  ERLI's  AlethGen  (Coch,  1996b),  start  with  the  deep 
levels  and  map  from  level  to  level  until  they  reach  the  surface  level. 

Discourse  generation  has  been  a concern  of  NLG  from  the  beginning. 
Davey’s  PROTEUS,  for  example,  produced  paragraph-length  summaries  of 
tic-tac-toe  games.  His  system  structured  its  output  based  heavily  upon  the 
structure  of  the  trace  of  the  game  which  the  application  system  recorded. 
Schema-based  text  structuring,  pioneered  by  McKeown  (1985),  is  more  flex- 
ible and  has  been  used  in  a number  of  applications  (Milosavljevic,  1997; 
Paris,  1993;  McCoy,  1985).  The  schema-based  example  presented  in  this 
chapter  is  based  on  the  COMET  instruction  generation  system  (McKeown 
et  al.,  1990).  Although  other  theories  of  discourse  structure  (cf.  Chapter  18) 
have  influenced  NLG,  including  theories  by  Grosz  and  Sidner  (1986),  Hobbs 
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(1979a),  and  Kamp's  DRT  (1981),  Rhetorical  Structure  Theory  (RST),  devel- 
oped by  Mann  and  Thompson  (1987b),  has  had  the  most  influence  (Marcu, 
1998;  Scott  and  Souza,  1990;  Hovy,  1988b).  The  classic  automobile  exam- 
ple in  ths  chapter  is  adapted  from  Mann  and  Thompson  (Mann  and  Thomp- 
son, 1986),  and  the  RST-based  planning  example  is  based  on  Moore  and 
Palis’  text  planner  (Moore  and  Paris,  1993)  as  it  was  used  in  the  DRAFTER 
(Palis  and  Vander  Linden,  1996;  Paris  et  al.,  1995),  ISOLDE  (Paris  el  al, 
1998)  and  WIP  (Wahlster  el  al,  1993)  projects.  The  use  of  this  planner  in 
the  context  of  an  interactive  dialog  system  is  described  by  Moore  and  Paris 
(1993).  A more  recent  alternative  to  this  approach  has  been  developed  by 
Marcu  (1998). 

Applications  of  NLG  tend  to  focus  on  relatively  restricted  sublanguages 
(cf.  Chapter  21),  including  weather  reports  (Coch,  1998;  Goldberg  et  al, 
1994),  instructions  (Paris  el  al,  1998;  Paris  and  Vander  Linden,  1996;  Wahlster 
et  al,  1993),  encyclopedia-like  descriptions  (Milosavljevic,  1997;  Dale  et  al, 
1998b),  and  letters  (Reiter  et  al,  1999).  The  output  can  be  delivered  as  sim- 
ple text  or  hypertext  (Lavoie  et  al,  1997;  Paris  and  Vander  Linden,  1996), 
dynamically  generated  hypertext  (Dale  et  al,  1998b),  multimedia  presen- 
tation (Wahlster  et  al,  1993),  and  speech  (Van  Deemter  and  Odijk,  1997). 
Information  on  a number  of  these  systems  is  available  on-line  at  the  SIGGEN 
web  site. 

The  evaluation  of  NLG  systems  has  received  much  recent  attention. 
Evaluations  have  assessed  the  similarity  of  the  output  with  a representative 
corpus  (Yeh  and  Mellish,  1997;  Vander  Linden  and  Martin,  1995),  convened 
panels  of  experts  to  review  the  text  (Lester  and  Porter,  1997;  Coch,  1996a), 
and  tested  how  effective  the  text  was  at  achieving  its  communicative  purpose 
(Reiter  et  al,  1999).  It  is  also  becoming  more  common  for  the  usability  of 
the  NLG  system  itself  to  be  evaluated. 

Other  issues  of  interest  in  NLG  include  the  use  of  connectionist  and 
statistical  techniques  (Langkilde  and  Knight,  1998;  Ward,  1994),  and  the 
viability  of  multilingual  generation  as  an  alternative  to  machine  translation 
(Hartley  and  Paris,  1997;  Goldberg  et  al,  1994). 


Exercises 

20.1  Use  the  systemic  grammar  given  in  the  chapter  to  build  a multiple- 
layer  analysis  of  the  following  sentences: 
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a.  The  document  will  be  saved  by  the  system. 

b.  Will  the  document  be  saved  by  the  system? 

c.  Save  the  document. 

20.2  Extend  the  systemic  grammar  given  in  the  chapter  to  handle  the  fol- 
lowing sentences: 

a.  The  document  is  large,  (a  “relational  process”) 

b.  Give  the  document  to  Mary. 

c.  Is  the  document  saved?  (a  “polar  interrogative”) 

20.3  Use  the  FUF  grammar  given  in  the  chapter  to  build  a fully  unified  FD 
for  the  following  sentences: 

a.  The  system  saves  the  document. 

b.  The  systems  save  the  document. 

c.  The  system  saves  the  documents. 

20.4  Extend  the  FUF  grammar  given  in  the  chapter  to  handle  the  following 
sentences: 

a.  The  document  will  be  saved  by  the  system,  (i.e.,  the  passive) 

b.  Will  the  document  be  saved  by  the  system?  (i.e.,  wh-  questions) 

c.  Save  the  document,  (i.e.,  imperative  commands) 

20.5  Select  a restricted  sublanguage  (cf.  Chapter  21)  and  build  either  a sys- 
temic or  FUF  generation  grammar  for  it.  The  sublanguage  should  be  subset 
of  a restricted  domain  such  as  weather  reports,  instructions,  or  responses  to 
simple  inquires.  As  a test,  you  can  download  either  FUF  or  KPML,  whichever 
is  appropriate,  and  implement  your  grammar.  Both  systems  can  be  found 
through  the  SIGGEN  web  site.  (Note  that  it  is  much  easier  to  build  test 
grammars  with  FUF  than  with  KPML.) 

20.6  Compare  and  contrast  the  SPL  input  to  KPML  (discussed  in  the  biblio- 
graphical and  historical  notes)  and  the  FD  input  to  FUF.  What  decisions  arc 
required  of  the  discourse  planner  for  each  of  them?  What  arc  their  relative 
strengths  and  weaknesses? 

20.7  (Adapted  from  McKeown  (1985))  Build  an  ATN  appropriate  for  struc- 
turing a typical  encyclopedia  entry.  Would  it  be  in  any  way  different  from 
an  ATN  for  a dictionary  entry,  and  if  so,  could  you  adapt  the  same  ATN  for 
both  purposes? 
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20.8  (Adapted  from  Bateman  (1997))  Build  a system  network  for  using 
“dr”,  “mr”,  “ms”,  “mrs”,  “miss”  in  expressions  like  “Miss.  Jones”  and  “Mr. 
Smith”.  What  information  would  the  knowledge  base  need  to  contain  to 
make  the  appropriate  choices  in  your  network? 


20.9  Do  an  RST  analysis  for  the  following  text: 

Temperature  Adjustment 

Before  you  begin,  be  sure  that  you  have  administrator  access 
to  the  system.  If  you  do,  you  can  perform  the  following  steps: 

a.  From  the  EMPLOYEE  menu  select  the  Adjust  Temperature 
item.  The  system  displays  the  Adjust  Temperature  dialog 
box. 

b.  Select  the  room.  You  may  either  type  the  room  number  or 
click  on  the  appropriate  room's  icon. 

c.  Set  the  temperature.  In  general  you  shouldn’t  change  the 
temperature  too  drastically. 

d.  Click  the  ok  button.  The  system  sets  the  room  temperature. 

By  entering  a desired  temperature,  you  arc  pretending  that 
you  just  adjusted  the  thermostat  of  the  room  that  you  arc  in. 

The  chapter  lists  a subset  of  the  RST  relations.  Does  it  give  you  all  the 
relations  you  need?  How  do  you  think  your  analysis  would  compare  with 
the  analyses  produced  by  other  analysts? 

20.10  How  does  RST  compare  with  Grosz  and  Sidner’s  theory  of  discourse 
presented  in  Chapter  18?  Does  one  encompass  the  other  or  do  they  address 
different  issues?  Why  do  you  think  that  RST  has  had  a greater  influence  on 
NLG? 

20.11  Would  RST  be  useful  for  interactive  dialog?  If  so,  how  would  you 
use  it?  If  not,  what  changes  would  you  make  to  get  it  to  work 

20.12  (Adapted  from  ISOLDE  (Paris  el  al.,  1998))  Speculate  on  how  you 
would  enhance  an  RST-based  discourse  planner  to  plan  multi-modal  dis- 
course, which  would  include  diagrams  and  formatting  (such  as  html  for- 
matting). 

20.13  (Adapted  from  STOP  (Reiter  et  al.,  1999)).  This  chapter  did  not 
discuss  template  generators  in  any  detail,  it  simply  mentioned  that  they  arc 
easy  to  implement  but  inflexible.  Try  writing  a simple  template  generator 
that  produces  persuasive  letters  addressed  to  people  trying  to  convince  them 
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to  stop  smoking.  The  letter  should  include  the  standard  elements  of  a letter 
as  well  as  a discussion  of  the  dangers  of  smoking  and  the  advantages  of 
quitting.  For  ideas,  you  can  visit  the  STOP  web  site,  available  through  the 
SIGGEN  web  site. 

How  flexible  can  you  make  the  mechanism  within  the  confines  of  tem- 
plate generation?  Can  you  extend  the  system  to  take  a case  file  on  a partic- 
ular patient  that  contains  their  medical  history  and  produces  a customized 
letter? 

20.14  (Adapted  from  PEBA  (Milosavljevic,  1997)).  In  the  manner  dis- 
cussed in  exercise  20. 13,  write  a template  generator  that  produces  encyclopedia- 
like descriptions  of  animals.  For  ideas,  you  can  visit  the  PEBA  II  web  site, 
available  through  the  SIGGEN  web  site. 
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. . . Translation  is  a fine  and  exacting  art,  but  there  is  much  about 
it  that  is  mechanical  and  routine  Kay  ( 1997) 


This  chapter  introduces  techniques  for  machine  translation  (MT),  the  translation 
use  of  computers  to  automate  some  or  all  of  the  process  of  translating  from  mt 
one  language  to  another.  Translation,  in  its  full  generality,  is  a difficult, 
fascinating,  and  intensely  human  endeavor,  as  rich  as  any  other  area  of  hu- 
man creativity.  Consider  the  following  passage  from  the  end  of  Chapter 
45  of  the  18th-century  novel  The  Story  of  the  Stone,  also  called  Dream  of 
the  Red  Chamber,  by  Cao  Xue  Qin  (Cao,  1973),  with  the  Chinese  original 
transcribed  in  the  Mandarin  dialect,  and  the  English  translation  by  David 
Hawkes: 

As  she  lay  there  alone,  Dai-yu’s  thoughts  turned  to  Bao-chai. . . Then  she  lis- 
tened to  the  insistent  rustle  of  the  rain  on  the  bamboos  and  plantains  outside 
her  window.  The  coldness  penetrated  the  curtains  of  her  bed.  Almost  without 
noticing  it  she  had  begun  to  cry. 

dai  yu  zi  zai  chuang  shang  gan  nian  bao  chai. . . 

Dai-yu  alone  on  bed  top  think-of-with-gratitude  Bao-chai 

you  ting  jian  chuang  wai  zhu  shao  xiang  ye  zhe 

again  listen  to  window  outside  bamboo  tip  plantain  leaf  of 

shang,  yu  sheng  xi  li,  qing  han  tou  mu, 

on-top,  rain  sound  sigh  drip,  clear  cold  penetrate  curtain, 

bu  jue  you  di  xia  lei  lai. 

not  feeling  again  fall  down  tears  come. 

Consider  some  of  the  issues  involved  in  this  kind  of  literary  transla- 
tion. First,  there  is  the  problem  of  how  to  translate  the  Chinese  names, 
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complicated  by  Cao’s  frequent  use  of  names  involving  wordplay.  Hawkes 
chose  to  use  transliterations  for  the  names  of  the  main  characters  but  to 
translate  names  of  servants  by  their  meanings  (Aroma,  Skybright).  Chinese 
rarely  marks  verbal  aspect  or  tense;  Hawkes  thus  had  to  decide  to  translate 
Chinese  tou  as  penetrated , rather  than  say  was  penetrating  or  had  pene- 
trated. Hawkes  also  chose  the  possessive  pronoun  her  to  make  her  window 
more  appropriate  for  the  mood  of  a quiet  bedroom  scene  than  the  window, 
To  make  the  image  clear  for  English  readers  unfamiliar  with  Chinese  bed- 
curtains,  Hawkes  translated  ma  (‘curtain’)  as  curtains  of  her  bed.  Finally, 
the  phrase  bamboo  tip  plantain  leaf  although  elegant  in  Chinese,  where  such 
four-character  phrases  arc  a hallmark  of  literate  prose,  would  be  awkward  if 
translated  word-for-word  into  English,  and  so  Hawkes  used  simply  bamboos 
and  plantains. 

Translation  of  this  sort  clearly  requires  a deep  and  rich  understanding 
of  the  source  language  and  the  input  text,  and  a sophisticated,  poetic,  and 
creative  command  of  the  target  language.  The  problem  of  automatically 
producing  a high-quality  translation  of  an  arbitrary  text  from  one  language 
to  another  is  thus  far  too  hard  to  automate  completely.  But  certain  simpler 
translation  tasks  can  be  addressed  with  current  computational  models.  In 
particular,  machine  translation  system  often  focus  on  (1)  tasks  for  which 
a rough  translation  is  adequate,  (2)  tasks  where  a human  post-editor  can 
be  used  to  improve  MT  output,  and  (3)  tasks  limited  to  small  sublanguage 
domains  in  which  fully  automatic  high  quality  translation  is  achievable. 

Information  acquisition  on  the  Web  is  the  kind  of  ‘information  pull’ 
task  where  readers  may  be  willing  to  settle  for  a very  rough  translation.  Con- 
sider these  extracts  from  a French  web  page  and  a machine  translation: 

Nous  sommes  une  association  type  Loi  de  1901,  et  notre  raison  d’etre 
est  de  practiquer,  de  promouvoir,  de  faire  decouvrir  le  Paintball,  et  le 
cas  echeant  de  supporter  nos  equipes  de  competition:  ...  Si  vous  avez 
des  questions,  des  envies  d’organisation  de  parties,  des  envies  de  jouer 
tout  courte  et  des  envies  de  decouvrir,  n’hesitez  pas  a nous  contacter 
par  courrier  ou  par  telephone  ou  bien  encore  par  eMail.  . . . Au  sortir  de 
la  saison  97/98  et  surtout  au  debut  de  cette  saison  98/99,  les  effectifs 
des  HORS-TAXE  sont  modifies. 

We  are  a standard  association  Loi  of  1901,  and  our  raison  d’ecirc;tre  is  to 
practice,  promote,  make  discover  Paintball,  and  to  support  our  teams  of  com- 
petition if  necessary:  ...  If  you  have  questions,  desires  of  organization  of 
parts,  desires  for  playing  very  short  and  desires  for  discovering,  do  not  hes- 


itate  to  contact  us  by  mail  or  telephone  or  even  by  eMail.  . . . With  leaving 
season  97/98  and  especially  at  the  beginning  of  this  season  98/99,  manpower 
of  the  HORS-TAXE  are  modified! 
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This  is  good  enough  to  figure  out  that  we  have  the  found  the  home 
page  of  a paintball  team,  and  one  that  seems  friendly  and  perhaps  willing  to 
accept  new  members.  Armed  with  this  information,  we  can  then  try  to  find 
someone  to  properly  translate  it  for  us,  or  perhaps  just  go  ahead  and  send 
e-mail  to  the  organizer  to  ask  if  we  can  play.  Incidentally,  the  use  of  MT  for 
such  document-finding  purposes  can  sometimes  be  avoided  or  made  more 
efficient  by  using  cross-language  information  retrieval  techniques,  which 
focus  on  the  retrieval  of  documents  in  a language  other  than  that  used  for  the 
query  terms  (Oard,  1997). 

Rough  translation  is  also  useful  as  the  first  stage  in  a complete  trans- 
lation process.  An  MT  system  can  produce  a draft  translation  that  can  be 
fixed  up  in  a post-editing  process  by  a human  translator.  Even  a rough 
draft  can  sometimes  speed  up  the  overall  translation  process.  Strictly  speak- 
ing, systems  used  in  this  way  arc  doing  computer-aided  human  transla- 
tion (CAHT  or  CAT)  rather  than  (fully  automatic)  machine  translation.  This 
model  of  MT  usage  is  effective  especially  for  high  volume  jobs  and  those 
requiring  quick  turn-around.  The  most  familial-  example  is  perhaps  the  trans- 
lation of  software  manuals  for  localization  to  reach  new  markets.  Another 
effective  application  is  the  translation  of  market-moving  financial  news,  for 
example  from  Japanese  to  English  for  use  by  stock  traders. 

Weather  forecasting  is  an  example  of  a sublanguage  domain  that  can 
be  modeled  completely  enough  to  use  raw  MT  output  even  without  post- 
editing. Weather  forecasts  consist  of  phrases  like  Cloudy  with  a chance  of 
showers  today  and  Thursday .,  Low  tonight  4,  high  Thursday  10.  and  Out- 
look for  Friday:  sunny.  This  domain  has  a limited  vocabulary  and  only  a 
few  basic  phrase  types.  Ambiguity  is  rare,  and  the  senses  of  ambiguous 
words  are  distinct  and  easily  disambiguated  based  on  local  context,  using 
word  classes  and  semantic  features  such  as  MONTH,  PLACE,  DIRECTION, 
TIME  POINT,  TIME  DURATION,  DEGREE-OF-POSSIBILITY.  Other  domains 
that  are  sublanguage-like  include  equipment  maintenance  manuals,  air  travel 
queries,  appointment  scheduling,  and  restaurant  recommendations. 
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This  chapter  breaks  with  the  pattern  of  previous  chapters  in  that  the 
focus  is  less  on  introducing  new  techniques  than  on  showing  how  the  tech- 
niques presented  earlier  are  used  in  practice.  One  of  the  themes  of  this  chap- 
ter is  that  there  are  often  trade-offs  and  difficult  choices  among  alternative 
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approaches  and  techniques. 

Section  21.1  gives  some  simple  illustrations  of  the  ways  in  which  lan- 
guages differ.  The  following  four  sections  are  organized  four  basic  mod- 
els for  doing  MT:  Section  21.2  introduces  the  use  of  syntactic  transforma- 
tions for  overcoming  differences  in  grammar,  as  well  as  some  techniques  for 
choosing  target  language  words.  Section  21.3  introduces  some  ways  of  ex- 
ploiting meaning  during  translation,  in  particular  the  use  of  thematic  roles 
and  primitive  decomposition.  Section  21.4  presents  the  minimalist  ‘direct’ 
approach.  Section  21.5  discusses  the  use  of  statistical  techniques  to  improve 
various  aspects  of  MT.  Finally,  Section  21.6  discusses  reasons  for  the  gap 
between  expectations  and  performance,  and  discusses  strategies  for  meeting 
users’  needs  despite  finite  development  resources. 


21.1  Language  Similarities  and  Differences 
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When  you  accidentally  pick  up  a radio  program  in  some  foreign  language  it 
seems  like  chaos,  completely  unlike  the  familial-  languages  of  your  everyday 
life.  But  there  are  patterns  in  this  chaos,  and  indeed,  some  aspects  of  human 
language  seem  to  be  universal,  holding  true  for  every  language.  Many  uni- 
versal arise  from  the  functional  role  of  language  as  a communicative  system 
by  humans.  Every  language,  for  example,  seems  to  have  words  for  referring 
to  people,  for  talking  about  women,  men,  and  children,  eating  and  drink- 
ing, for  being  polite  or  not.  Other  universals  are  more  subtle;  for  example 
Chapter  8 mentioned  that  every  language  seems  to  have  nouns  and  verbs. 

Even  when  languages  differ,  these  differences  often  have  systematic 
structure.  The  study  of  systematic  cross-linguistic  similarities  and  differ- 
ences is  called  typology  (Croft  (1990),  Connie  (1989)).  This  section  sketches 
some  typological  facts  about  crosslinguistic  similarity  and  difference.  This 
hears  on  our  main  topic,  MT,  in  that  the  difficulty  of  translating  from  one 
language  to  another  depends  a great  deal  on  how  similar  the  languages  are 
in  their  vocabulary,  grammar,  and  conceptual  structure. 

Morphologically,  languages  are  often  characterized  along  two  dimen- 
sions of  variation.  The  first  is  the  number  of  morphemes  per  word,  rang- 
ing from  isolating  languages  like  Vietnamese  and  Cantonese,  in  which  each 
word  generally  has  one  morpheme,  to  polysynthetic  languages  like  Siberian 
Yupik  (Eskimo),  in  which  a single  word  may  have  very  many  morphemes, 
corresponding  to  a whole  sentence  in  English.  The  second  dimension  is  the 
degree  to  which  morphemes  are  segmentable,  ranging  from  agglutinative 


Section  21.1.  Language  Similarities  and  Differences 


801 


languages  like  Turkish  (discussed  in  Chapter  3),  in  which  morphemes  have 
relatively  clean  boundaries,  to  fusion  languages  like  Russian,  in  which  a fusion 
single  affix  may  conflate  multiple  morphemes,  like  -om  in  the  word  stolorn, 
(table-SG-lN STR-DECL 1 ) which  fuses  the  distinct  morphological  categories 
instrumental,  singular,  and  first  declension. 

Syntactically,  languages  arc  perhaps  most  saliently  different  in  the  ba- 
sic word  order  of  verbs,  subjects,  and  objects  in  simple  declarative  clauses. 
German,  French,  English,  and  Mandarin,  for  example,  arc  all  SVO  lan-  svo 
guages,  meaning  that  the  verb  tends  to  come  between  the  subject  and  object. 

Hindi  and  Japanese,  by  contrast,  arc  SOV  languages,  meaning  that  the  verb  sov 
tends  to  come  at  the  end  of  basic  clauses,  while  Irish,  Classical  Arabic,  and 
Biblical  Hebrew  arc  VSO  languages.  Two  languages  that  share  their  basic  vso 
word-order  type  often  have  other  similarities.  For  example  SVO  languages 
generally  have  prepositions  while  SOV  languages  generally  have  postposi- 
tions; English  has  to  Yuriko  where  Japanese  has  Yuriko  ni. 

Another  important  syntactico-morphological  distinction  is  between 
head-marking  and  dependent-marking  languages  (Nichols,  1986).  Head- 
marking languages  tend  to  mark  the  relation  between  the  head  and  its  depen- 
dents on  the  head.  Dependent-marking  languages  tend  to  mark  the  relation 
on  the  non-head.  Nichols  (1986)  for  example,  notes  that  Hungarian  marks 
the  possessive  relation  with  an  affix  (A)  on  the  head  noun  (H),  where  English 
marks  it  on  the  (non-head)  possessor: 

(2 El)  English  the  man-A’s  //house 
Hungarian  az  ember  Whaz-Aa 
the  man  house-his 

This  syntactic  distinction  is  related  to  a semantic  distinction  in  how 
languages  map  conceptual  notions  onto  words.  Talmy  (1985)  and  (1991) 
noted  that  languages  can  be  characterized  by  whether  direction  of  motion 
and  manner  of  motion  arc  marked  on  the  verb  or  on  the  ‘satellites’:  particles, 
prepositional  phrases,  or  adverbial  phrases.  For  example  a bottle  floating  out 
of  a cave  would  be  described  in  English  with  the  direction  marked  on  the 
particle  out  as: 

(21.2)  The  bottle  floated  out. 

but  in  Spanish  with  the  direction  marked  on  the  verb  as 

(2L3)  La  botella  solid  flotando. 

The  bottle  exited  floating. 
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Languages  that  mark  the  direction  of  motion  on  the  verb  (leaving  the 
satellites  to  mark  the  manner  of  motion)  Talmy  called  verb-framed;  Slobin 
(1996)  gives  examples  like  Spanish  acercarse  ‘approach’,  alcanzar  ‘reach’, 
entrar  ‘enter’,  salir  ‘exit’.  Languages  that  mark  the  direction  of  motion  on 
the  satellite  (leaving  the  verb  to  mark  the  manner  of  motion)  Talmy  called 
satellite-framed;  Slobin  (1996)  gives  examples  like  English  crawl  out,  float 
off,  jump  down,  walk  over  to,  run  after.  Talmy  (1991)  noted  that  verb- 
framed languages  include  Romance,  Semitic,  Japanese,  Tamil,  Polynesian, 
most  Bantu,  most  Mayan,  Nez  Perce,  and  Caddo,  while  satellite-framed  lan- 
guages include  most  Indo-European  minus  Romance,  Finno-Ugric,  Chinese, 
Ojibwa,  and  Warlpiri. 

In  addition  to  such  properties  that  systematically  vary  across  large 
classes  of  languages,  there  arc  many  specific  characteristics,  more  or  less 
unique  to  single  languages.  English,  for  example,  has  an  idiosyncratic  syn- 
tactic construction  involving  the  word  there  that  is  often  used  to  introduce  a 
new  scene  in  a story,  as  in  there  burst  into  the  room  three  men  with  guns. 

To  give  an  idea  of  how  trivial,  yet  crucial,  these  differences  can  be, 
think  of  dates.  Dates  not  only  appeal-  in  various  formats  — typically  YYM- 
MDD  in  Japanese,  MM-DD-YY  in  American  English,  and  DD/MM/YY  in 
British  English  — the  calendars  themselves  may  differ,  for  example  dates  in 
Japanese  often  are  relative  to  the  start  of  the  current  Emperor’s  reign  rather 
than  to  the  start  of  the  Christian  Era. 

Turning  now  to  the  question  of  lexical  organization,  here  too  there 
are  interesting  patterns.  Many  words  can  be  translated  relatively  directly 
into  other  languages.  English  dog,  for  example,  translates  to  Mandarin  gou. 
Where  English  has  chocolate,  Italian  has  cioccolato  and  Japanese  has  choko- 
reeto. 1 

Sometimes,  rather  than  a single  word,  there  is  a fixed  phrase  in  the 
target  language;  French  informatique  thus  translates  to  English  computer 
science.  In  more  difficult  cases,  however,  a word  in  one  language  does  not 
map  so  simply  to  a word  or  phrase  in  another  language. 

Grammatically,  for  example,  a word  may  translate  best  to  a word  of  an- 
other paid  of  speech  in  the  target  language.  Many  English  sentences  involv- 
ing the  verb  like  must  be  translated  into  German  using  the  adverbial  gem; 
thus  she  likes  to  sing  maps  to  sie  singt  gerne,  where  the  syntactic  structure  is 
also  affected. 


1 although  chokoreeto  in  Japanese  is  perforce  more  formal  than  English  chocolate,  since 
Japanese  also  has  the  informal  short  form  choko. 
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Sometimes  one  language  places  more  grammatical  constraints  on  word 
choice  than  another.  English,  for  example,  distinguishes  gender  in  pronouns 
where  Mandarin  does  not;  thus  translating  a third-person  singular  pronoun 
from  Mandarin  to  English  requires  deciding  whether  the  original  referent 
was  masculine  or  feminine.  The  same  is  true  when  translating  from  the 
English  pronoun  plural  they,  unspecified  for  gender,  into  French  (masculine 
Us,  feminine  elles).  In  Japanese,  there  is  no  single  word  for  is,  speakers  must 
choose  between  iru  or  aru,  based  on  whether  the  subject  is  animate2  or  not. 

Such  differences  in  specificity  also  occur  on  the  semantic  side:  one  lan- 
guage may  divide  up  a particular  conceptual  domain  in  more  detail  than  an- 
other. English,  for  example,  has  a particularly  impoverished  kinship  vocab- 
ulary; the  single  word  brother  can  indicate  either  a younger  or  older  brother. 
Japanese  and  Chinese,  by  contrast,  both  distinguish  seniority  in  sibling  rela- 
tions. Figure  21.1  gives  some  further  examples. 


English 

brother 

Japanese 

Japanese 

Mandarin 

Mandarin 

otooto  (younger) 
oniisan  (older) 
gege  (older) 
didi  (older) 

English 

wall 

German 

German 

Wand  (inside) 
Mauer  (outside) 

English 

know 

French 

French 

connaitre  (be  acquainted  with) 
savoir  (know  a proposition) 

English 

they 

French 

French 

ils  (masculine) 
elles  (feminine) 

German 

berg 

English 

English 

hill 

mountain 

Mandarin 

td 

English 

he,  she,  or  it 

Figure  21.1 

Differences  in  specificity. 

The  way  that  languages  differ  in  lexically  dividing  up  conceptual  space 
may  be  more  complex  than  this  one-to-many  translation  problem,  leading  to 
many-to-many  mappings.  For  example  Figure  21.2  summarizes  some  of  the 
complexities  discussed  by  Hutchins  and  Somers  (1992)  in  relating  English 
leg,  foot,  and  paw,  to  the  French  jambe,  pied,  patte,  etc. 

Further,  one  language  may  have  a lexical  gap,  where  no  word  or  phrase,  lexical  gap 
short  of  an  explanatory  footnote,  can  express  the  meaning  of  a word  in  the 


2 Taxis  and  buses  in  service  sometimes  count  as  animate  for  this  purpose. 


804 


Chapter  21.  Machine  Translation 


other  language.  For  example,  Japanese  does  not  have  a word  for  privacy, 
and  English  does  not  have  a word  for  Japanese  oyakoko  (we  make  do  with 
filial  piety). 

Moreover,  dependencies  on  cultural  context,  as  manifest  in  the  back- 
ground and  expectations  of  the  readers  of  the  original  and  translation,  further 
complicate  matters.  A number  of  translation  theorists  (Steiner,  1975;  Barn- 
stone,  1993;  Hofstadter,  1997)  refer  to  a clever  story  by  Jorge  Luis  Borges 
showing  that  even  two  linguistic  texts  with  the  same  words  and  grammar  may 
have  different  meanings  because  of  their  different  cultural  contexts.  Borges 
invents  Menard,  a French  author  in  the  1930's  whose  aim  was  to  recreate 
Cervantes'  Don  Quixote  word  for  word: 

The  text  of  Cervantes  and  that  of  Menard  are  verbally  identical,  but  the  sec- 
ond is  almost  infinitely  richer.  (More  ambiguous,  his  detractors  will  say;  but 
ambiguity  is  a richness.)  It  is  a revelation  to  compare  the  Don  Quijote  of 
Menard  with  that  of  Cervantes.  The  latter,  for  instance,  wrote: 

. . . la  verdad,  cuya  madre  es  la  historia,  emula  del  tiempo,  deposito  de 
las  acciones,  testigo  de  lo  pasado,  ejemplo  y aviso  de  lo  presente,  ad- 
vertencia  de  lo  por  venir. 

Menard,  on  the  other  hand,  writes: 

. . .la  verdad,  cuya  madre  es  la  historia,  emula  del  tiempo,  deposito  de 
las  acciones,  testigo  de  lo  pasado,  ejemplo  y aviso  de  lo  presente,  ad- 
vertencia  de  lo  por  venir. 
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Equally  vivid  is  the  contrast  in  styles.  The  archaic  style  of  Menard  - in  the 
last  analysis,  a foreigner  — suffers  from  a certain  affectation.  Not  so  that  of 
his  precursor,  who  handles  easily  the  ordinary  Spanish  of  his  time. 

These  last  points  suggest  a more  general  question  about  cultural  dif- 
ferences and  the  possibility  (or  impossibility)  of  translation.  A theoretical 

SAPIR- 

position  sometimes  known  as  the  Sapir-Whorf  hypothesis  suggests  that  lan-  whorf 
guage  may  constrain  thought  — that  the  language  you  speak  may  affect  the 
way  you  think.  To  the  extent  that  this  hypothesis  is  true,  there  can  be  no  per- 
fect translation,  since  speakers  of  the  source  and  target  languages  necessarily 
have  different  conceptual  systems.  In  any  case  it  is  clear  that  the  differences 
between  languages  run  deep,  and  that  the  process  of  translation  is  not  going 
to  be  simple. 


21.2  The  Transfer  Metaphor 

As  the  previous  section  illustrated,  languages  differ.  One  strategy  for  doing 
MT  is  to  translate  by  a process  of  overcoming  these  differences,  altering  the 
structure  of  the  input  to  make  it  conform  to  the  rules  of  the  target  language. 

This  can  be  done  by  applying  contrastive  knowledge,  that  is,  knowledge  trastivege 
about  the  differences  between  the  two  languages.  Systems  that  use  this  strat- 
egy arc  sometimes  said  to  be  based  on  the  transfer  model.  model™ 

Since  this  requires  some  representation  of  the  structure  of  the  input, 
transfer  presupposes  a parse  of  some  form.  Moreover,  since  transfer  only 
results  in  a structure  for  the  target  language,  it  must  be  followed  by  a gener- 
ation phase  to  actually  create  the  output  sentence.  Thus,  on  this  model,  MT 
involves  three  phases:  analysis,  transfer,  and  generation,  where  transfer 
bridges  the  gap  between  the  output  of  the  source  language  parser  and  the 
input  to  the  target  language  generator.  Figure  21.3  shows  a sketch  of  this 
transfer  architecture. 

It  is  worth  noting  that  a parse  for  MT  may  differ  from  parses  required 
for  other  purposes.  For  example,  suppose  we  need  to  translate  John  saw  the 
girl  with  the  binoculars  into  French.  The  parser  does  not  need  to  bother  to 
figure  out  where  the  prepositional  phrase  attaches,  because  both  possibilities 
lead  to  the  same  French  sentence.  However  this  is  not  true  for  all  prepo- 
sitional phrase  attachments,  and  so  a MT  system  needs  also  to  be  able  to 
represent  disambiguated  parses,  while  still  being  able  to  work  with  ambigu- 
ous ones  (Emele  and  Dorna,  1998). 
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transfer  / 

/ source  language  \ 

/ target  language  \ 

1 parse  tree  i 

*-  j parse  tree  \ 

\ XX .../ 

parsing  j 

generation 

/"source  language  words*’-. 

("target  language  words  , 

Figure  21.3  The  transfer  architecture  for  Machine  Translation. 

Syntactic  Transformations 

Let  us  begin  by  considering  syntactic  differences.  The  previous  section  noted 
that  in  English  the  unmarked  order  in  a noun-phrase  had  adjectives  precede 
nouns,  but  in  French  adjectives  follow  nouns.3  Temporarily  postponing  the 
question  of  how  to  translate  the  words,  let’s  consider  how  an  MT  system  can 
overcome  such  differences. 


noun  phrase  noun  phrase 


adjective  noun  noun  adjective 


Figure  21.4  A simple  transformation  that  reorders  adjectives  and  nouns 


Figure  21.4  suggests  the  basic  idea.  Here  we  transform  one  parse  tree, 
suitable  for  describing  an  English  phrase,  into  another  parse  tree,  suitable 

SYNTACTIC 

TRANSFOR-  for  describing  a French  sentence.  In  general,  syntactic  transformations  are 
operations  that  map  from  one  tree  structure  to  another. 

Now  let’s  illustrate  how  roughly  how  such  transformations  can  restruc- 
ture an  entire  sentence,  using  a simplified  sentence: 

(21.4)  There  was  an  old  man  gardening. 

We  will  assume  that  the  parser  has  given  us  a structure  like  the  follow- 
ing. We  will  also  assume  that  the  system  starts  performing  transformations 

3 There  are  exceptions  to  this  generalization,  such  as  galore  in  English  and  gros  in  French; 
furthermore  in  French  some  adjectives  can  appear  before  the  noun  with  a different  meaning; 
route  mauvaise  ‘bad  road,  badly-paved  road’  versus  mauvaise  route  ‘wrong  road’  (Waugh, 
1976). 


Section  21.2.  The  Transfer  Metaphor 


807 


at  the  top  node  of  the  tree  and  works  its  way  down: 
Existential-There-Sentence 

there  was  an  old  man  gardening 

Since  this  sentence  involves  an  “existential  there  construction”,  which 
has  no  analog  in  Japanese,  we  immediately  have  to  apply  a transformation 
that  deletes  the  sentence -initial  there  and  converts  the  fourth  constituent  to 
a relative  clause  modifying  the  noun,  producing  something  like  following 
structure: 


Intermediate-Representation 


an 

The  resulting  structure  is  thus  something  more  like  the  structure  of  a 
pseudo-English  sentence:  an  old  man,  who  was  gardening,  was. 

Next,  another  transformation  applies  to  reverse  the  order  of  the  noun 
phrase  and  the  relative  clause,  giving  something  like  the  following  structure: 


I man  gardening  was 


Intermediate-Representation-2 


gardening  an  old  man  was 


At  this  point  all  relevant  transformations  have  applied,  and  lexical 
transfer  takes  place,  substituting  Japanese  words  for  the  English  ones,  as 
discussed  in  the  next  section.  This  gives  the  final  structure  below: 


Japanese-S 


niwa  no  teire  o suru  ojiisan  ita 

After  this,  a little  more  syntactic  work  is  required  to  produce  an  actual 
Japanese  sentence,  including:  1.  adding  the  word  ga,  which  is  required  in 
Japanese  to  mark  the  subject,  2.  choosing  the  verb  that  agrees  with  the  sub- 
ject in  terms  of  animacy,  namely  iru,  not  aru,  and  3.  inflecting  the  verbs.  The 
final  generation  step  traverses  or  otherwise  linearizes  the  tree  to  produce  a 
string  of  words.  Although  these  generation  tasks  can  be  done  by  the  tech- 
niques of  Chapter  20,  practical  systems  usually  do  them  directly  with  simple 
procedures.  In  any  case,  the  final  output  will  be: 
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niwa  no  teire  o shite  ita  ojiisan  ga  ita. 

garden  GEN  upkeep  OBJ  do  PAST-PROG  old  man  SUBJ  was 

Table  21.5  shows  a rough  representation  of  the  transformations  we 
have  discussed.  Such  transformations  can  be  implemented  as  pattern-rewrite 
rules:  if  the  input  matches  the  left  side  of  a transformation,  it  is  rewritten 
according  to  the  right  side. 


English  to  French: 

1. 

NP  — > Adjectivei  Noun2 
NP  — > Noun2  Adjective] 

Japanese  to  English: 

2. 

Existential-There-Sentence  — > There]  Veri^  NP3  PostnominaL 
Sentence  — > (NP  — > NP3  Relative-Clause^  Verdn 

3. 

NP  — > NP]  Relative  Clause2 
NP  — > Relative-Clause2  NP] 

Figure  21.5  An  informal  description  of  some  transformations. 

Transformations  in  MT  systems  also  may  have  more  complex  condi- 
tions for  when  they  apply,  and  may  include  a “trigger”,  that  is,  a specific 
word  that  is  used  to  index  the  pattern,  for  efficiency.  One  way  to  formalize 
transformations  is  with  unification-based  models;  indeed  as  Chapter  11  dis- 
cussed, the  need  for  a reversible  operation  for  MT  was  the  original  motiva- 
tion for  both  feature-structure  unification  (Kay,  1984)  and  term-unification 
(Colmerauer  and  Roussel,  1996).  However,  unification  is  computationally 
expensive  and  is  not  commonly  used. 

Lexical  Transfer 

Some  of  the  output  words  arc  determined  in  the  course  of  syntactic  transfer 
or  generation.  In  the  example  above,  the  function  words  ga  and  ita  are  mostly 
grammatically  controlled.  Content  words  arc  another  matter.  The  process  of 
finding  target  language  equivalents  for  the  content  words  of  the  input,  lexical 
transfer,  is  difficult  for  the  reasons  introduced  in  Section  21.1. 

The  foundation  of  lexical  transfer  is  dictionary  lookup  in  a crosslan- 
guage dictionary.  As  was  discussed  earlier,  the  translation  equivalent  may 


Section  21.3.  The  Interlingua  Idea:  Using  Meaning 


809 


be  a single  word  or  it  may  be  a phrase,  as  in  this  example  where  gardening 
becomes  niwa  no  teire  o suru  (‘do  garden  upkeep').  Furthermore,  sometimes 
a generation  process  must  subsequently  inflect  words  in  such  phrases,  as  in 
this  case. 

Section  21.1  also  discussed  the  problem  of  words  that  have  several  pos- 
sible translations.  In  the  example  man  is  such  a word.  The  correct  choice 
here  was  ojiisan  (‘old  man’),  but  if  the  input  had  been  man  is  the  only  linguis- 
tic animal , the  translation  of  man  would  have  been  ningen  (‘human  being, 
man,  men’);  in  most  other  cases  Into  (‘person,  persons,  man,  men’)  or  re- 
lated words  would  have  been  appropriate.  Fortunately  there  arc  at  least  two 
ways  to  tackle  this  problem:  in  the  parsing  or  in  the  generation  stage.  The 
first  method  is  to  treat  words  like  man  as  if  they  were  ambiguous.  That  is,  we 
assume  that  man  can  correspond  to  two  more  more  concepts  (perhaps  HU- 
MAN and  ADULT  MALE)  and  that  choosing  the  correct  Japanese  word  is  like 
disambiguating  between  these  concepts.  This  way  of  treating  lexical  transfer 
lets  us  apply  all  the  standard  techniques  for  lexical  disambiguation  (Chap- 
ter 16).  A second  way  is  to  treat  such  words  as  having  only  one  meaning, 
and  to  handle  the  selection  among  multiple  possible  translations  ( ningen , 
Into,  ojiisan  and  so  on)  by  using  constraints  imposed  by  the  target  language 
during  generation  (Whitelock,  1992).  In  practice,  these  cases  arc  more  of- 
ten dealt  with  in  the  parsing  stage,  as  the  algorithms  for  lexical  choice  dur- 
ing generation  arc  high-overhead  (Ward,  1994),  especially  for  content  words 
(but  see  Section  21.5). 

In  this  specific  example,  however,  the  choice  of  how  to  translate  man 
is  easy.  Because  the  previous  word  is  old,  the  correct  translation  is  ojiisan 
(‘old  man’).  Such  inputs,  where  multiple  source  language  words  must  be 
expressed  with  a single  target  language  word,  can  be  difficult  to  handle,  re- 
quiring inference  in  the  general  case.  But  many  such  cases,  including  this 
one,  can  be  treated  simply  as  idioms,  with  their  own  entries  in  the  bilingual 
dictionary. 

21.3  The  Interlingua  Idea:  Using  Meaning 

One  problem  with  the  transfer  model  is  that  it  requires  a distinct  set  of  trans- 
fer rules  for  each  pair  of  languages.  This  is  clearly  suboptimal  for  translation 
systems  employed  in  multilingual  environments  like  the  European  Union, 
where  eleven  official  languages  need  to  be  intertranslated. 

This  suggests  a different  perspective  on  the  nature  of  translation.  The 
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transfer  model  treats  translation  as  a process  of  altering  the  structure  and 
words  of  an  input  sentence  to  arrive  at  a valid  sentence  of  the  target  language. 
An  alternative  to  is  to  treat  translation  as  a process  of  extracting  the  meaning 
of  the  input  and  then  expressing  that  meaning  in  the  target  language.  If  this 
can  be  done,  a MT  system  can  do  without  contrastive  knowledge,  merely  re- 
lying on  the  same  syntactic  and  semantic  rules  used  by  a standard  interpreter 
and  generator  for  the  language.  The  amount  of  knowledge  needed  is  then 
proportional  to  the  number  of  languages  the  system  handles,  rather  than  to 
the  square,  or  so  the  argument  goes. 

This  scheme  presupposes  the  existence  of  a meaning  representation, 
or  interlingua,  in  a language-independent  canonical  form,  like  the  semantic 
representations  we  saw  in  Chapter  14.  The  idea  is  for  the  interlingua  to  rep- 
resent all  sentences  that  mean  the  ‘same’  thing  in  the  same  way,  regardless 
of  the  language  they  happen  to  be  in.  Translation  in  this  model  proceeds 
by  performing  a semantic  analysis  on  the  input  from  language  X into  the 
interlingual  representation  and  generating  from  the  interlingua  to  language 
Y. 

A frequently  used  element  in  interlingual  representations  is  the  notion 
of  a small  fixed  set  of  thematic  roles,  as  discussed  in  Chapter  16.  When  used 
in  an  interlingua,  these  thematic  roles  are  taken  to  be  language  universals. 
Figure  21.6  shows  a possible  interlingual  representation  for  there  was  an  old 
man  gardening  as  a unification-style  feature  structure4.  We  saw  in  Chap- 
ter 15  how  a semantic  analyzer  can  produce  such  a structure  with  a AGENT 
relation  between  man  and  gardening.  Note  that  since  the  interlingua  requires 
such  semantic  interpretation  in  addition  to  syntactic  parsing,  it  requires  more 
analysis  work  than  the  transfer  model,  which  only  required  syntactic  pars- 
ing. But  generation  can  now  proceed  directly  from  the  interlingua  with  no 
need  for  syntactic  transformations. 

Note  that  the  representation  in  Figure  21.6  includes  the  value  GAR- 
DENING as  the  value  for  the  EVENT  feature,  and,  although  such  cases  arc 
familial-  from  Chapter  14,  one  might  object  that  this  looks  more  like  an  En- 
glish word  than  it  does  an  an  element  in  a truly  interlingual  representation. 
There  is  a deeper  question  here,  that  of  the  appropriate  inventory  of  concepts 
and  relations  for  an  interlingua;  that  is  what  ontology  to  use.  Certainly  a 
meaning  representation  designer  has  a lot  of  freedom  when  selecting  a set 


4 Of  course  this  is  seriously  inadequate  as  an  account  of  the  meaning  of  the  existential-there 
construction.  In  fact,  the  currently  least  incomplete  account  of  the  syntax  and  semantics  of 
there  constructions  in  English  takes  124  pages  (Lakoff,  1987). 
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Figure  21.6  Interlingual  representation  of  there  was  an  old  man  gardening. 


of  tokens  and  ascribing  meanings  to  them.  However,  choice  of  an  ontology 
for  MT  is  not  to  be  undertaken  lightly,  since  it  constrains  the  architecture 
of  the  system  as  a whole.  For  example,  recall  from  Chapter  16  the  discus- 
sion of  two  possible  inventories  of  thematic  roles,  one  containing  AGENT 
and  FORCE,  and  one  including  AGENT  only.  The  choice  of  which  to  adopt 
affects,  for  example,  the  way  that  the  system  will  translate  the  quake  broke 
glass  (Chapter  16)  into  Japanese,  where  quake  needs  to  be  marked  with  de, 
not  the  usual  subject  marker  ga,  because  the  earthquake  is  not  animate.  If 
we  design  our  interlingua  using  the  smaller  inventory  that  only  uses  AGENT, 
then  the  representation  for  this  sentence  will  place  the  quake  in  the  AGENT 
role,  and  the  problem  of  de  versus  ga  will  fall  to  the  generator.  If,  however, 
we  use  the  expanded  inventory  of  Figure  16.9,  then  the  representation  will 
include  the  FORCE  role,  with  the  work  needed  to  make  that  decision  being 
performed  by  the  semantic  analyzer. 

The  interlingua  idea  has  implications  not  only  for  syntactic  transfer 
but  also  for  lexical  transfer.  The  idea  is  to  avoid  explicit  descriptions  of 
the  relations  between  source  language  words  and  target  language  words,  in 
favor  of  mapping  via  concepts,  that  is,  language-independent  elements  of 
the  ontology.  Recalling  our  earlier  problem  of  whether  to  translate  man  as 
otoko,  ningen,  ojiisan,  etc.  it  is  clear  that  most  of  the  processing  involved  is 
not  specific  to  the  goal  of  translating  into  Japanese;  there  is  a more  general 
problem  of  disambiguating  man  into  concepts  such  as  GENERIC-HUMAN 
and  MALE-HUMAN.  If  we  commit  to  using  such  concepts  in  an  interlingua, 
then  a larger  paid  of  the  translation  process  can  be  done  with  general  lan- 
guage processing  techniques  and  modules,  and  the  processing  specific  to  the 
English-to-Japanese  translation  task  can  be  eliminated  or  at  least  reduced. 

Some  interlinguas,  and  some  other  representations,  go  further  and  use 
lexical  decomposition,  that  is,  the  disassembly  of  words  into  their  component 
meanings.  We  saw  a form  of  this  in  Figure  21.6,  where  was  maps  to  PAST  and 


812 


Chapter  21.  Machine  Translation 


PROGRESSIVE,  and  a maps  to  SINGULAR  and  INDEFINITE.  Decomposition 
of  content  words  is  also  possible:  the  word  drink  can  be  represented  by 
(INGEST,  FLUID,  BY-MOUTH)5.  Representing  a sentence  by  breaking  down 
the  words  in  such  ways  does  seem  to  be  actually  capturing  something  about 
meaning,  rather  than  being  just  a rearrangement  of  tokens  that  look  like  the 
English  words  of  the  input.  Moreover,  such  representations  arc  potentially 
useful  for  inference-based  disambiguation.  For  example,  it  is  possible  to  use 
the  meanings  of  the  words  to  infer  what  the  prepositional  phrase  is  modifying 
in  the  policeman  saw  the  man  with  a telescope,  versus  the  policeman  shot 
the  man  with  a telescope.  It  is,  however,  difficult  to  get  inference  of  this 
sort  to  work  for  more  than  a few  examples  except  in  very  small  domains. 
In  general,  such  high-powered  interlingua-based  techniques  arc  not  used  in 
practice. 


Interlingua 

/ \ 

interpretation 

\^generation 

transfer ... 

/ source  language  \ / 

! parse  tree  ! 

XX  / 

target  language  - 
parse  tree 

XX .../ 

parsing  j 

\,  generation 

("source  language  words-. 

( target  language  words  = 

Figure  21.7  Diagram  Suggesting  the  Relation  Between  the  Transfer  and 
Interlingua  Models,  generally  credited  to  Vauqois. 

Brushing  over  numerous  important  details,  we  can  now  contrast  the 
transfer  model  with  the  interlingua  model.  The  key  implication  for  process- 

5 This  use  of  semantic  decomposition  makes  it  clear  which  elements  of  meaning  drink 
shares  with  eat  and  which  it  does  not  share.  But  as  Chapter  16  discusses,  lexical  semantics  is 
not  so  easy  in  general.  For  example,  how  does  one  express  in  a formal  language  the  meaning 
of  heft  and  the  way  it  differs  from  weight,  or  the  meanings  of  sporadic  and  intermittent ? 
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ing  is  that,  by  making  the  parser/interpreter  and/or  the  generator  do  a little 
more  work,  we  can  eliminate  the  need  for  contrastive  knowledge,  as  sug- 
gested in  Figure  21.7. 

Doing  the  extra  work  involved  by  the  interlingua  commitment,  how- 
ever, is  not  always  easy.  It  requires  the  system  designer  to  perform  exhaus- 
tive analysis  of  the  semantics  of  the  domain  and  formalize  that  in  an  ontol- 
ogy (Levin  et  ah,  1998).  Today  this  is  more  an  art  than  a science,  although 
it  is  relatively  tractable  in  sublanguage  domains.  In  some  cases  the  seman- 
tics can  mostly  be  captured  by  a database  model,  as  in  the  air  travel,  hotel 
reservation,  or  restaurant  recommendation  domains.  In  cases  like  these,  the 
database  definition  determines  the  possible  entities  and  relations;  and  the 
MT  system  designer’s  task  is  largely  one  of  determining  how  these  map  to 
the  words  and  structures  of  the  two  languages. 

Another  problem  with  the  interlingua  idea  is  that,  in  its  pure  form,  it 
requires  the  system  to  fully  disambiguate  at  all  times.  For  a true  universal 
interlingua,  this  may  require  some  unnecessary  work.  For  example,  in  order 
to  translate  from  Japanese  to  Chinese  the  interlingua  must  include  concepts 
such  as  ELDER- BROTHER  and  YOUNGER- BROTHER.  However,  to  use  those 
same  concepts  in  the  course  of  translating  from  German-to-English  would 
require  a parser  to  perform  more  disambiguation  effort  than  is  unnecessary; 
and  will  further  require  the  system  to  include  techniques  for  preserving  am- 
biguity, to  ensure  that  the  output  is  ambiguous  or  vague  in  exactly  the  same 
way  as  the  input.  Even  discounting  the  Sapir-Whorf  idea,  the  idea  of  a uni- 
versal meaning  underlying  all  languages  is  clearly  not  without  problems. 


21.4  Direct  Translation 

These  models  arc  all  very  nice,  but  what  happens  if  the  analysis  fails?  Users 
do  not  like  to  receive  an  output  of  “nil”  due  to  “no  parse  tree  found”;  in  gen- 
eral, they  would  rather  get  something  imperfect  than  nothing  at  all.  This  is  a 
challenge  especially  for  interlingua-based  models,  where  the  system  should 
not  fail  to  translate  it  broke  the  glass  because  it  can  not  figure  out  whether  it 
is  a FORCE  or  AGENT. 

Several  approaches  are  available.  One  is  to  use  the  robust  parsing  tech- 
niques discussed  in  Chapter  15,  which  sometimes  amounts  to  translating  by 
fragments.  Another  is  to  give  up  on  producing  elaborate  structural  analyses 
at  all,  and  just  do  simple  operations  that  can  be  done  reliably.  More  radically, 
we  could  adopt  the  principle  that  a MT  system  should  do  as  little  work  as 


PRESERVING 

AMBIGUITY 
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possible.  Systems  built  according  to  this  philosophy  arc  sometimes  called 
direct  direct  MT  systems.  Typically  such  systems  arc  built  with  only  one  language 
pair  in  mind,  and  the  only  processing  done  is  that  needed  to  get  from  one 
specific  source  language  to  one  specific  target  language. 

A direct  MT  system  is  typically  composed  of  several  stages,  each  fo- 
cused on  one  type  of  problem.  For  example,  we  can  rewrite  a Japanese 
sentence  as  an  English  one  in  six  stages,  as  seen  in  Figure  21.8.  Figure  21.9 


Stage 

Action 

1. 

morphological  analysis 

2. 

lexical  transfer  of  content  words 

3. 

various  work  relating  to  prepositions 

4. 

SVO  rearrangements 

5. 

miscellany 

6. 

morphological  generation 

Figure  21.8  Six  Stages  for  a Direct  MT  System  for  Japanese  to  English 

illustrates  how  this  might  work  for  a simple  example. 

Stage  1 in  Figure  21.9  segments  the  input  string  into  words  (recall  that 
Japanese,  like  Chinese,  does  not  use  spaces  as  word  boundary  markers),  and 
does  morphological  analysis  of  complex  verb  forms.  These  can  be  done 
using  the  finite-state  techniques  of  Chapter  3 and  segmentation  algorithms 
like  the  probabilistic  one  described  in  Chapter  5. 

Stage  2 chooses  translation  equivalents  for  the  content  words.  This  is 
done  using  a bilingual  dictionary,  or  procedures  that  choose  the  correct  trans- 
lation based  on  the  local  context  and  on  the  target  language  words  already 
chosen.  Figure  21.10  illustrates  such  a procedure. 

In  this  example  lexical  transfer  is  trivial.  In  general,  though,  there  may 
be  interdependencies  among  target-language  words,  and  so  lexical  trans- 


input: 

After  stage  1 : 
After  stage  2: 
After  stage  3: 
After  stage  4: 
After  stage  5 : 
After  stage  6: 

Figure  21.9 


watashihatsukuenouenopenwojonniageta. 

watashi  ha  tsukue  no  ue  no  pen  wo  jon  ni  ageru  PAST. 

I ha  desk  no  ue  no  pen  wo  John  ni  give  PAST. 

I ha  pen  on  desk  wo  John  to  give  PAST. 

I give  PAST  pen  on  desk  John  to. 

I give  PAST  the  pen  on  the  desk  to  John. 

I gave  the  pen  on  the  desk  to  John. 

An  Example  of  Processing  in  a Direct  System 
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fer  this  may  be  done  in  sub-stages,  for  example,  verbs  before  nouns  be- 
fore adjectives.  For  example,  consider  the  problem  of  translating  no  mu  from 
Japanese  to  English,  where  this  must  become  either  drink  or  take  (medicine). 
This  decision  must  be  made  before  translations  for  modifiers  are  chosen,  to 
allow  translations  such  as  drinking  heavily  and  taking  a lot  of  medicine,  but 
not  a scramble  of  the  two.  In  general  the  problem  of  the  best  order  in  which 
to  make  decisions  is  a tricky  one,  although  there  arc  some  standard  solutions, 
as  seen  in  Chapter  20. 

Stage  3 chooses  to  translate  no  ue  no  (‘at  top  of’)  to  on,  and  reverses 
the  two  associated  noun  phrases  (desk  and  pen),  since  English  prepositional 
phrases  follow,  not  precede,  the  word  they  modify.  In  accordance  with  the 
dictionary  entry  for  gave,  which  specifies  subcategorization  facts,  it  chooses 
to  translate  ni  as  to. 

Stage  4 invokes  a procedure  to  move  the  verb  from  the  end  of  the  sen- 
tence to  the  position  after  the  subject,  and  removes  case  marking  from  sub- 
jects and  direct  objects. 

Stage  5 handles  things  like  moving  case  markers  before  nouns  and  in- 
serting articles. 

Finally  Stage  6 inflects  the  verbs. 


function  Directly _TRANSLATE_MUCH/MANY(Russian  word)  returns 

if  preceding  word  is  how 
return  skol’ko 
else  if  preceding  word  is  as 
return  stol’ko  zhe 
else  if  word  is  much 

if  preceding  word  is  very 
return  nil  (not  translated) 
else  if  following  word  is  a noun 
return  mnogo 
else  /*  word  is  many  */ 

if  preceding  word  is  a preposition  and  following  word  is  a noun 
return  mnogii 
else  return  mnogo 


Figure  21.10  A procedure  for  translating  much  and  many  into  Russian, 
adapted  from  Hutchins’  (1986,  pg.  133)  discussion  of  Panov  1960. 


There  arc  several  ways  in  which  this  approach  differs  from  the  ap- 
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proaches  seen  earlier.  One  is  that  it  is  a new  way  of  modularizing  the  MT 
task,  orthogonal  to  the  types  of  modularity  seen  in  the  transfer  and  interiin- 
gua  models  in  Figure  21.7.  In  the  direct  model,  all  the  processing  involving 
analysis  of  one  specific  problem  (prepositions  for  example)  is  handled  in  one 
stage,  including  analysis,  transfer,  and  generation  aspects.  The  advantage  of 
this  is  that  solving  specific  problems  one  at  a time  may  be  more  tractable. 
On  the  other  hand,  it  can  be  advantageous  to  organize  processing  into  larger 
modules  (analysis,  transfer,  synthesis)  if  there  is  synergy  among  all  the  var- 
ious individual  analysis  problems,  or  among  all  the  individual  generation 
problems,  etc. 

A second  characteristic  of  direct  systems  is  that  lexical  transfer  may 
be  more  procedural.  Lexical  transfer  procedures  may  eclectically  look  at  the 
syntactic  classes  and  semantic  properties  of  neighboring  words  and  depen- 
dents and  heads,  as  seen  in  the  decision-tree-like  procedure  for  translating 
much  and  many  into  Russian  in  Figure  21.10. 

A third  characteristic  of  direct  models  is  that  they  tend  to  be  conser- 
vative, to  only  reorder  words  when  required  by  obvious  ungranmiaticality  in 
the  result  of  direct  word-for-word  substitution.  In  particular,  direct  systems 
generally  do  lexical  transfer  before  syntactic  processing. 

Perhaps  the  key  characteristic  of  direct  models  is  that  they  do  with- 
out complex  structures  and  representations.  In  general,  they  treat  the  input 
as  a string  of  words  (or  morphemes),  and  perform  various  operations  di- 
rectly on  it  — replacing  source  language  words  with  target  language  words, 
re-ordering  words,  etc.  — to  end  up  with  a string  of  symbols  in  the  target 
language. 

In  practice,  of  course,  working  MT  systems  tend  to  be  combinations  of 
the  direct,  transfer,  and  interlingua  methods.  But  of  course  syntactic  process- 
ing is  not  an  all-or-nothing  thing.  Even  if  the  system  does  not  do  a full  parse, 
it  can  adorn  its  input  with  various  useful  syntactic  information,  such  as  paid 
of  speech  tags,  segmentation  into  clauses  or  phrases,  dependency  links,  and 
bracketings.  Many  systems  that  arc  often  characterized  as  direct  translation 
systems  also  adopt  various  techniques  generally  associated  with  the  transfer 
and  interlingua  approaches  (Hutchins  and  Somers,  1992). 


21.5  Using  Statistical  Techniques 

The  three  architectures  for  MT  introduced  in  previous  sections,  the  transfer, 
interlingua,  and  direct  models,  all  provide  answers  to  the  questions  of  what 
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representations  to  use  and  what  steps  to  perform  to  translate.  But  there  is 
another  way  to  approach  the  problem  of  translation:  to  focus  on  the  result, 
not  the  process.  Taking  this  perspective,  let’s  consider  what  it  means  for  a 
sentence  to  be  a translation  of  some  other  sentence. 

This  is  an  issue  to  which  philosophers  of  translation  have  given  a lot  of 
thought.  The  consensus  seems  to  be,  sadly,  that  it  is  impossible  for  a sentence 
in  one  language  to  be  a translation  of  a sentence  in  other,  strictly  speaking. 
For  example,  one  cannot  really  translate  Hebrew  adonai  roi  (‘the  Lord  is  my 
shepherd’)  into  the  language  of  a culture  that  has  no  sheep.  On  the  one  hand, 
we  can  write  something  that  is  clear  in  the  target  language,  at  some  cost  in 
fidelity  to  the  original,  something  like  the  Lord  will  look  after  me.  On  the 
other  hand,  we  can  be  faithful  to  the  original,  at  the  cost  of  producing  some- 
thing obscure  to  the  target  language  readers,  perhaps  like  the  Lord  is  for  me 
like  somebody  who  looks  after  animals  with  cotton-like  hair.  As  another  ex- 
ample, if  we  translate  the  Japanese  phrase  fukaku  hansei  shite  orimasu , as 
we  apologize,  we  arc  not  being  faithful  to  the  meaning  of  the  original,  but  if 
we  produce  we  are  deeply  reflecting  ( on  our  past  behavior,  and  what  we  did 
wrong,  and  how  to  avoid  the  problem  next  time),  then  our  output  is  unclear 
or  awkward.  Problems  such  as  these  arise  not  only  for  culture-specific  con- 
cepts, but  whenever  one  language  uses  a metaphor,  a construction,  a word, 
or  a tense  without  an  exact  parallel  in  the  other  language. 

So,  true  translation,  which  is  both  faithful  to  the  source  language  and 
natural  as  an  utterance  in  the  target  language,  is  sometimes  impossible.  If 
you  arc  going  to  go  ahead  and  produce  a translation  anyway,  you  have  to 
compromise.  This  is  exactly  what  translators  do  in  practice:  they  produce 
translations  that  do  tolerably  well  on  both  criteria. 

This  provides  us  with  a hint  for  how  to  do  MT.  We  can  model  the 
goal  of  translation  as  the  production  of  an  output  that  maximizes  some  value 
function  that  represents  the  importance  of  both  faithfulness  and  fluency.  If 
we  chose  the  product  of  fluency  and  faithfulness  as  our  quality  metric,  we 
can  formalize  the  translation  problem  as: 

best-translation  f = argmax7  fluency(T)  faithfulness(T,S) 

where  T is  the  target-language-sentence  and  S the  source-language-sentence. 

This  model  of  translation  was  first  described  by  researchers  coming 
from  speech  recognition  (Brown  et  ah,  1990a,  1993),  and  this  model  clearly 
resembles  the  Bayesian  models  we’ve  used  for  speech  recognition  in  Chap- 
ter 7 and  for  spell  checking  in  Section  5.4.  We  can  make  the  analogy  perfect 
and  apply  the  noisy  channel  model  of  Section  5.4  if  we  think  of  things  back- 
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wards:  thus  we  pretend  that  the  input  we  must  translate  is  a corrupted  version 
of  some  target  language  sentence,  and  that  our  task  is  to  discover  that  target 
language  sentence: 

best-translation  T = argmaxr  P(T)  P(S\T) 

To  implement  this,  we  need  to  do  three  things:  quantify  fluency,  P(T), 
quantify  faithfulness,  P(S\T ) and  create  an  algorithm  to  find  the  sentence 
that  maximizes  the  product  of  these  two  things. 

There  is  an  innovation  here.  In  the  transfer,  interlingua,  and  direct 
models,  each  step  of  the  process  made  some  adjustment  to  the  input  sentence 
to  make  it  closer  to  a fluent  TL  sentence,  while  obeying  the  constraint  of  not 
changing  the  meaning  too  much.  In  those  models  the  process  is  fixed,  in  that 
there  is  no  flexibility  to  trade-off  a modicum  of  faithfulness  for  a smidgeon  of 
naturalness,  or  conversely,  based  on  the  specific  input  sentence  at  hand.  This 
new  model,  sometimes  called  the  statistical  model  of  translation  allows 
exactly  that. 


Quantifying  Fluency 

Fortunately,  we  already  have  some  useful  metrics  for  how  likely  a sentence 
is  to  be  a real  English  sentence:  the  language  models  from  Chapters  6 and 
8.  These  allow  us  to  distinguish  things  that  are  readable  but  not  really  En- 
glish (such  as  that  car  was  almost  crash  onto  me)  from  things  that  arc  more 
fluent  (that  car  almost  hit  me).  This  is  especially  valuable  for  word  order 
and  collocations,  and  as  such  can  be  a useful  supplement  to  the  generation 
techniques  of  Chapter  20. 

Fluency  models  can  be  arbitrarily  sophisticated;  any  technique  that  can 
assign  a better  probability  to  a target  language  string  is  appropriate,  including 
the  more  sophisticated  probabilistic  grammars  of  Chapter  12  or  the  statistical 
semantic  techniques  of  Chapter  17. 

Of  course,  the  idea  of  using  monolingual  language  knowledge  to  im- 
prove MT  output  is  independent  of  the  decision  to  model  that  knowledge 
statistically.  Indeed,  many  MT  systems,  especially  direct  ones,  have  a final 
phase,  in  which  the  system  uses  local  considerations  to  revise  word  choices 
in  the  output.  For  example,  capitalizing  every  occurrence  of  white  house 
that  occurs  as  the  subject  of  a verb  (the  white  house  announced  today)  is  a 
reasonable  heuristic. 
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Quantifying  Faithfulness 

Given  the  French  sentence  ca  me  plait  {that  me  pleases ) and  some  conceiv- 
able English  equivalents  that  pleases  me,  / like  it,  and  I’ll  take  that  one,  and 
yes,  good,  it  is  intuitively  clear  that  the  first  is  more  faithful. 

Although  it  is  hard  to  quantify  this  intuition,  one  basic  factor  often  used 
in  metrics  for  fidelity  is  the  degree  to  which  the  words  in  one  sentence  arc 
plausible  translations  of  the  words  of  the  other.  Thus  we  can  approximate 
the  probability  of  a sentence  being  a good  translation  as  the  product  of  the 
probabilities  that  each  target  language  word  is  an  appropriate  translation  of 
some  source  language  word.  For  this  we  need  to  know,  for  every  source 
language  word,  the  probability  of  it  mapping  to  each  possible  target  language 
word. 

Where  do  we  get  these  probabilities?  Standard  bilingual  dictionaries 
do  not  include  such  information,  but  they  can  be  computed  from  bilingual 
corpora,  that  is,  parallel  texts  in  two  languages.  This  is  not  trivial,  since 
bilingual  corpora  do  not  come  with  annotations  specifying  which  word  maps 
to  which.  Solving  this  problem  requires  first  solving  the  problem  of  sen- 
tence alignment  in  a bilingual  corpus,  determining  which  source  language 
sentence  maps  to  which  target  language  sentence,  which  can  be  done  with 
reasonable  accuracy  (Kay  and  Roscheisen,  1993;  Gale  and  Church,  1993; 
Melamed,  1999;  Manning  and  Schutze,  1999).  The  second  problem,  word 
alignment,  that  is,  determining  which  word(s)  of  the  target  correspond  to 
each  source  language  word  or  phrase,  is  rather  more  difficult  (Melamed, 
pear),  and  is  often  addressed  with  EM  methods  (cf.  Chapter  7).  From  bilin- 
gual corpora  aligned  in  these  ways  it  is  possible  to  count  how  many  times  a 
word,  phrase,  or  structure  gets  mapped  to  each  of  its  possible  translations. 
Such  alignments  arc  potentially  useful  not  only  for  MT  but  also  for  auto- 
matic generation  of  bilingual  dictionary  entries  for  use  by  human  translators 
(Dagan  and  Church,  1997;  Fung  and  McKeown,  1997). 

Fet’s  now  consider  an  example.  Suppose  we  want  to  translate  the  two- 
word  Japanese  phrase  2000nen  taio  into  English.  The  most  probable  transla- 
tion for  the  first  word  is,  we  will  assume,  2000,  followed  by  year  2000,  Y2K, 
2000  years,  2000  year  and  some  other  possibilities.  The  most  probable  trans- 
lation for  the  second  word  is,  we  will  assume,  correspondence,  followed  by 
corresponding,  equivalent,  tackle,  deal  with,  dealing  with,  countermeasures, 
respond,  response,  counterpart,  antithesis  and  so  on.  Thus,  according  to  the 
translation  model  alone,  the  most  highly  ranked  candidate  will  be  the  com- 
position of  the  most  highly  ranked  words,  namely  2000  countermeasures. 
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But,  when  the  contribution  of  the  fluency  model,  perhaps  a bigram  model,  is 
factored  in,  the  candidate  translation  dealing  with  Y2K  will  have  the  highest 
overall  score. 

Of  course,  more  complex  translations  models  arc  possible:  anything 
that  generates  multiple  translations  with  a ranking  associated  with  each.  It 
is  even  possible  to  do  “multi-engine”  translation,  where  several  translation 
models  (for  example  a powerful  but  brittle  interlingua-based  one  and  a robust 
but  low-quality  direct  one)  arc  run  in  parallel  to  generate  various  translations 
and  translation  fragments,  with  the  final  output  determined  by  assembling 
the  pieces  which  have  highest  confidence  scores  (Brown  and  Frederking, 
1995). 

Search 

So  far  we  have  a theory  of  which  sentence  is  best,  but  not  of  how  to  find  it. 
Since  the  number  of  possible  translations  is  enormous,  we  must  find  the  best 
output  without  actually  generating  the  infinite  set  of  all  possible  translations. 
But  this  is  just  a decoding  problem,  of  the  kind  we  have  seen  how  to  solve  via 
the  pruned  Viterbi  (beam-search)  and  A*  algorithms  of  Chapter  7.  For  MT 
this  decoding  is  done  in  the  usual  way:  outputs  (translations)  arc  generated 
incrementally,  and  evaluated  at  each  point.  If  at  any  point  the  probability 
drops  below  some  criterion  that  line  of  attack  is  pruned.  Generation  can  be 
left  to  right  or  outward  from  heads. 

Good  introductions  to  statistical  MT  include  (Brown  et  ah,  1990b)  and 
(Knight,  1997).  One  of  the  most  influential  recent  systems  is  described  in 
(Knight  et  al.,  1994). 


21.6  Usability  and  System  Development 

Since  MT  systems  are  generally  run  by  human  operators,  the  human  is  avail- 
able to  help  the  machine.  One  way  to  use  human  intervention  is  interac- 
tively; that  is,  when  the  system  runs  into  a problem,  it  can  ask  the  user. 
For  example,  a system  given  the  input  the  chicken  are  ready  to  eat  could 
generate  paraphrases  of  both  possible  meanings,  and  present  the  user  with 
those  alternatives,  for  example,  asking  her  to  decide  whether  the  sentence 
means  the  chicken  are  ready  to  be  eaten  or  the  chicken  are  ready  to  eat 
something.  It  turns  out  that  this  is  incredibly  annoying  — users  do  not  like 
to  have  to  answer  questions  from  a computer,  or  to  feel  that  they  exist  to  help 
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the  computer  get  its  work  done  (Cooper,  1995).  On  the  other  hand,  people 
arc  comfortable  with  the  job  of  fixing  up  poorly-written  sentences,  and  so 
post-editing  is  the  normal  mode  of  human  interaction  with  MT  systems. 

People  arc  also  able  to  edit  sentences  of  the  source  language,  and  this 
ability  can  be  exploited  as  way  to  improve  the  translatability  of  the  input  by 
simplifying  it  in  various  ways.  Such  pre-editing  can  be  more  cost-effective 
than  post-editing  if  a single  document  needs  to  be  translated  into  several  lan- 
guages, since  the  cost  of  pre-editing  can  then  be  amortized  over  many  output 
languages  — as  is  often  the  case  for  companies  which  sell  things  complete 
with  documentation,  in  many  countries  (Mitamura  and  Nyberg,  1995).  In 
order  to  decide  what  needs  pre-editing,  one  way  is  to  apply  MT  and  see 
what  comes  out  wrong,  and  then  go  back  and  rewrite  those  sentences  in  the 
original.  Another  way  is  to  have  a model  of  what  MT  ought  to  handle,  and 
require  input  sentences  to  be  rewritten  in  that  sublanguage,  for  example,  by 
disallowing  PPs  which  could  attach  ambiguously.  If  such  a model  exists,  the 
pre-editing  phase  can  actually  be  dispensed  with,  by  training  the  technical 
writers  to  only  write  in  simple,  unambiguous  controlled  language,  a version 
of  English  that  passes  the  constraints  of  the  sublanguage  grammar  checker. 
Doing  so  may  also  make  the  source  language  text  more  understandable.  This 
is  interesting  as  a case  where  focusing  on  the  larger  task  (getting  information 
from  tech  writers  to  customers),  rather  than  the  problem  as  originally  posed 
(to  translate  some  existing  documents),  leads  to  improvements  of  the  entire 
process. 

In  general,  user  satisfaction  is  vital  for  MT  systems.  Various  evaluation 
metrics  arc  used  to  predict  acceptability.  Evaluation  metrics  for  MT  intended 
to  be  used  raw  (for  information  acquisition)  include  the  percentage  of  sen- 
tences translated  correctly,  or  nearly  correctly,  where  correctness  depends  on 
both  fidelity  and  fluency.  The  typical  evaluation  metric  for  MT  output  to  be 
post-edited  is  edit  cost,  either  relative  to  some  standard  translation  via  some 
automatic  measure  of  edit-distance,  similar  to  those  seen  in  Chapter  7 for 
evaluating  speech  recognition,  or  measured  directly  as  the  amount  of  time 
(or  number  of  keystrokes)  required  to  correct  the  output  to  an  acceptable 
level. 

In  general  the  content  words  arc  crucial;  users  can  generally  recover 
from  scrambled  syntax,  but  having  the  words  translated  properly  is  vital.  In 
practice,  one  of  the  major  advantages  of  using  a MT  system  is  that  it  handles 
most  of  the  tedious  work  of  looking  up  words  in  bilingual  dictionaries.6  As  a 
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result,  professional  MT  users  put  great  value  on  dictionary  size  and  quality. 
Such  users  typically  augment  the  basic  system  dictionary  with  the  purchase 
of  a domain-specific  dictionary  designed  for  the  type  of  translation  work 
they  do:  medical,  electronic,  financial,  military  intelligence  etc.  But  no  off- 
the-shelf  dictionary,  even  one  developed  from  a corpus  of  texts  in  the  proper 
domain  area,  is  more  than  an  approximation  to  the  dictionary  needed  by 
a specific  customer,  and  so  established  translation  bureaus  typically  invest 
substantial  effort  in  augmenting  the  system  dictionaries  with  entries  of  their 
own.  The  structure  of  these  dictionaries  is  simple  because  the  specialist 
terminology  of  any  field  is  generally  unambiguous  — a photon  is  a photon  is 
a photon,  no  matter  what  context  it  comes  up  in  — and  because  terminology 
is  almost  invariably  open-class  words,  with  no  syntactic  idiosyncrasies. 

It  has  also  become  apparent  that  MT  systems  do  better  if  the  dictionar- 
ies include  not  only  words  but  also  idioms,  fixed  phrases,  and  even  frequent 
clauses  and  sentences.  Such  data  can  sometimes  be  extracted  automatically 
from  corpora.  Moreover,  in  some  situations  it  may  be  valuable  to  do  this 
on-line,  at  translation  time,  rather  than  saving  the  results  in  a dictionary  — 
this  is  they  key  idea  behind  Example-based  Machine  Translation  (Sumita 
and  Iida,  1991;  Brown,  1996). 

User  satisfaction  also  turns  out  to  depend  on  factors  other  than  the  ac- 
tual quality  of  the  translation.  Many  users  care  less  about  output  quality  than 
other  factors,  such  as  cost,  speed,  storage  requirements,  the  ability  to  run 
transparently  inside  their  favorite  editor,  the  ability  to  preserve  SGML  tags, 
and  so  on.  Translation  memory,  the  ability  to  store  and  recall  previously 
corrected  translations,  is  also  a big  selling  point. 

Although  for  expository  purposes  the  previous  sections  have  focussed 
on  a few  basic  problems  that  arise  in  translation,  it  is  important  to  realize 
that  these  far  from  exhaust  the  things  that  MT  systems  have  to  worry  about. 
As  Section  21.1  may  have  suggested,  language  differences  are  a virtually 
inexhaustible  source  of  complexity;  and  if  you  were  reading  the  footnotes  in 
the  previous  sections,  you  may  have  been  annoyed  that  every  “fact”  we  men- 
tioned about  a language  was  actually  an  oversimplification.  Indeed,  much  of 
the  work  developing  a MT  system  is  down  in  the  weeds,  dealing  with  details 
like  this,  regardless  of  the  overall  system  architecture  chosen.  Furthermore, 
adding  more  knowledge  does  not  always  help,  since  a working  MT  system, 
like  any  huge  software  system,  is  a large,  delicate  piece  of  code.  Improve- 
ment to  the  treatment  of  one  phenomenon,  or  a correction  of  a bug  in  the 
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translation  of  one  sentences,  can  cause  other  sentences,  previously  translated 
correctly,  to  go  awry. 

Given  all  this,  it  is  surprising  that  MT  systems  so  as  well  as  they  do. 
One  development  technique  of  proven  value  is  iterative  development:  build 
it,  evaluate  it  in  actual  use,  improve  it,  and  repeat.  In  the  course  of  this 
process  the  MT  system  is  adapted  to  a domain,  to  the  working  habits  of  its 
users,  and  to  the  needs  of  the  consumers  of  the  output. 


21.7  Summary 

• Although  MT  systems  exploit  many  standard  language-processing  tech- 
niques, there  arc  also  some  MT-specific  ones,  including  notably  syn- 
tactic transformations. 

• We  have  presented  four  models  for  MT,  the  transfer,  interlingua,  di- 
rect, and  statistical  approaches.  Practical  MT  systems  today,  how- 
ever, typically  combine  ideas  from  several  of  these  models;  while  MT 
research  systems  arc  probing  other  niches  in  the  design  space. 

• MT  system  design  is  hard  work,  requiring  careful  selection  of  models 
and  algorithms  and  combination  into  a useful  system.  Today  this  is 
more  a craft  than  a science,  especially  since  this  must  be  done  while 
minimizing  development  cost. 

• While  MT  system  design  today  is  thus  fairly  ad  hoc,  there  arc  ongoing 
efforts  to  develop  useful  formal  models  of  translation  (Alshawi  et  al. , 
1998;  Knight  and  Al-Onaizan,  1998;  Wu  and  Wong,  1998). 

• While  the  possibilities  for  improvement  for  MT  is  truly  impressive, 
the  output  of  today’s  systems  is  acceptable  for  rough  translations 
for  information-acquisition  purposes,  draft  translations  intended  to 
be  post-edited  by  a human  translator,  and  translation  for  sublanguage 
domains. 

• As  for  many  software  tasks,  user  interface  issues  in  MT  arc  crucial;  the 
value  of  MT  systems  to  users  is  not  directly  related  to  the  sophistication 
of  their  algorithms  or  representations,  nor  even  necessarily  to  output 
quality. 

• Despite  half  a century  of  research,  MT  is  far  from  solved.  Human 
language  is  a rich  and  fascinating  area  whose  treasures  have  only  begun 
to  be  explored. 
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Work  on  models  of  the  process  and  goals  of  translation  goes  back  at  least  to 
Saint  Jerome  in  the  fourth  century  (Kelley,  1979).  The  development  of  log- 
ical languages,  free  of  the  imperfections  of  human  languages,  for  reasoning 
correctly  and  for  communicating  truths  and  thereby  also  for  translation,  has 
been  pursued  at  least  since  the  1600s  (Hutchins,  1986). 

By  the  late  1940s,  scant  years  after  the  birth  of  the  electronic  computer, 
the  idea  of  MT  was  raised  seriously  (Weaver,  1955a).  In  1954  the  first  public 
demonstration  of  a MT  system  prototype  (Dostert,  1955)  led  to  great  excite- 
ment in  the  press  (Hutchins,  1997).  The  next  decade  saw  a great  flowering  of 
ideas,  prefiguring  most  subsequent  developments.  But  this  work  was  ahead 
of  its  time  — implementations  were  limited  by,  for  example,  the  fact  that 
pending  the  development  of  disks  there  was  no  good  way  to  store  dictionary 
information. 

As  high  quality  MT  proved  elusive  (Bar-Hillel,  1960),  a growing  con- 
sensus on  the  need  for  more  basic  research  in  the  new  fields  of  formal  and 
computational  linguistics  led  in  the  mid  1960s  to  a dramatic  cut  in  fund- 
ing for  MT  research.  As  MT  research  lost  academic  respectability,  the  As- 
sociation for  Machine  Translation  and  Computational  Linguistics  dropped 
MT  from  its  name.  Some  MT  developers,  however,  persevered,  slowly 
and  steadily  improving  their  systems,  and  slowly  garnering  more  customers. 
Systran  in  particular,  developed  initially  by  Peter  Toma,  has  been  contin- 
uously improved  over  40  years.  Its  earliest  uses  were  for  information  ac- 
quisition, for  example  by  the  US  Air  Force  for  Russian  documents;  and  in 
1976  an  English-French  edition  was  adopted  by  the  European  Community 
for  creating  rough  and  post-editable  translations  of  various  administrative 
documents.  Our  translation  example  in  the  introduction  was  produced  using 
the  free  Babelfish  version  of  Systran  on  the  Web.  Another  early  successful 
MT  system  was  Meteo,  which  translated  weather  forecasts  from  English  to 
French;  incidentally,  its  original  implementation  (1976),  used  “Q-systems”, 
an  early  unification  model. 

The  late  1970s  saw  the  birth  of  another  wave  of  academic  interest  in 
MT.  One  source  of  excitement  was  the  possibility  of  using  Artificial  Intel- 
ligence techniques  ideas,  originally  developed  for  story  understanding  and 
knowledge  engineering  (Carbone 1 1 et  al.,  1981).  This  interest  in  meaning- 
based  techniques  was  also  a reaction  to  the  dominance  of  syntax  in  computa- 
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tional  linguistics  at  that  time.  Another  motivation  for  the  use  of  interlingual 
models  was  their  introspective  plausibility:  the  idea  that  MT  systems  should 
translate  as  people  do  (presuming  that  people  translate  by  using  their  abil- 
ity to  understand).  Introspection  here  may  be  misleading,  since  the  process 
of  human  translation  is  enormously  complex  and  furthermore  the  relevance 
for  machine  translation  is  unclear.  Concerns  about  such  issues  were  much 
discussed  in  the  late  1980s  and  early  1990s  Tsujii  (1986),  Nirenburg  et  al. 
(1992),  Ward  (1994),  Carbone  1 1 et  al.  (1992).  Meanwhile  MT  usage  was 
increasing,  fueled  by  the  increase  in  international  trade  and  the  growth  of 
governments  with  policies  requiring  the  translation  of  all  documents  into 
multiple  official  languages,  and  enabled  by  the  proliferation  of  word  proces- 
sors, and  then  personal  computers,  and  then  the  World  Wide  Web. 

The  1990s  saw  the  application  of  statistical  methods,  enabled  by  the 
development  of  large  corpora.  Excitement  was  provided  by  the  “grand  chal- 
lenge” of  building  speech-to-speech  translation  systems  (Kay  et  ah,  1992; 
Bub  et  al. , 1997 ; Frederking  et  al. , pear)  where  MT  catches  up  with  the  mod- 
ern vision  of  computers  being  embedded,  ubiquitous  and  interactive.  On  the 
practical  side,  with  the  growth  of  the  user  population,  user’s  needs  have  had 
an  increasing  effect  on  priorities  for  MT  research  and  development. 

Good  surveys  of  the  early  history  of  MT  arc  Hutchins  (1986)  and 
(1997).  The  textbook  by  Hutchins  and  Somers  (1992)  includes  a wealth 
of  examples  of  language  phenomena  that  make  translation  difficult,  and  ex- 
tensive descriptions  of  some  historically  significant  MT  systems. 

Academic  papers  on  machine  translation  appeal-  in  the  journal  Machine 
Translation  and  in  the  proceedings  of  the  biennial  (odd  years)  Conferences 
on  Theoretical  and  Methodological  Issue  in  Machine  Translation. 

Reports  on  systems,  markets,  and  user  experiences  can  be  found  in  MT 
News  International,  the  newsletter  of  the  International  Association  for  Ma- 
chine Translation,  which  is  the  umbrella  organization  for  the  three  regional 
MT  societies:  the  Association  for  MT  in  the  Americas,  the  Asian-pacific 
Association  for  MT,  and  the  European  Association  for  MT.  These  societies 
have  annual  meetings  which  bring  together  developers  and  users.  The  pro- 
ceedings of  the  biennial  MT  Summit  (odd  years)  are  also  often  published. 
The  mainstream  computational  linguistics  journals  and  conferences  also  oc- 
casionally report  work  in  machine  translation. 
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Exercises 

21.1  Select  at  random  a paragraph  of  Chapter  9 which  describes  a fact 
about  English  syntax,  a)  Describe  and  illustrate  how  your  favorite  foreign 
language  differs  in  this  respect,  b)  Explain  how  a MT  system  could  deal 
with  this  difference. 

21.2  Go  to  the  literature  section  of  the  library,  and  find  a foreign  language 
novel  in  a language  you  know.  Copy  down  the  shortest  sentence  on  the  first 
page.  Now  look  up  the  rendition  of  that  sentence  in  an  English  translation  of 
the  novel,  a)  For  both  original  and  translation,  draw  parse  trees,  b)  For  both 
original  and  translation,  draw  dependency  structures,  c)  Draw  a case  struc- 
ture representation  of  the  meaning  which  the  original  and  translation  share, 
d)  What  does  this  exercise  suggest  to  you  regarding  intermediate  representa- 
tions for  MT? 

21.3  Pick  a word  from  the  first  sentence  of  the  top  article  of  today’s  news- 
paper. a)  List  the  possible  equivalents  found  in  a bilingual  dictionary  b) 
Sketch  out  how  a MT  system  could  choose  the  appropriate  translation  to  use 
based  on  the  context  of  occurrence,  c)  Sketch  out  how  this  could  be  done 
without  using  contrastive  knowledge. 

21.4  The  idea  of  example -based  MT  can  be  extended  to  “translation  by 
analogy”  (Sato  andNagao,  1990).  a)  Given  the  bilingual  data  in  Figure  21.11, 
what  Japanese  word  do  you  think  would  be  appropriate  as  a translation  of  on 
in  research  on  gastropods!  b)  Specify  an  algorithm  for  doing  lexical  transfer 
in  this  way.  c)  How  is  your  approach  similar  to  choice  of  TL  words  by  using 
a TL  language  model  (Section  21.5)?  d)  How  is  it  similar  to  disambiguation 
using  semantic  features  as  in  Chapter  16? 


the  cat  on  the  mat 

no  ue  no 

more  notes  on  decision  making 

ni  tsuite 

pink  frosting  on  the  cake 

no 

see  boats  on  the  pond 

no,  ni 

always  reading  on  the  bus 

de 

Figure  21.11  A mini-corpus  of  made-up  phrases  involving  on  and  their 
Japanese  translations 

21.5  Type  a sentence  into  a MT  system  (perhaps  a free  demo  on  the  Web) 


Section  21.7.  Summary 
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and  see  what  it  outputs,  a)  List  the  problems  with  the  translation,  b)  Rank 
these  problems  in  order  of  severity,  c)  For  the  two  most  severe  problems, 
suggest  the  probable  root  cause. 

21.6  Since  natural  languages  arc  hard  to  deal  with,  due  to  ambiguities,  ir- 
regularities, and  other  complexities,  it  is  much  nicer  to  work  with  something 
with  is  more  logical:  something  that  does  not  have  these  ‘flaws’  of  natu- 
ral language.  As  a result,  various  notations  which  arc  (in  some  ways)  less 
ambiguous  or  more  regular  than  English  have  been  proposed.  In  addition 
to  various  meaning  representation  schemes,  natural  languages  such  as  Es- 
peranto and  Sanskrit,  have  also  been  proposed  for  use  as  interlinguas  for 
machine  translation.  Is  this  a good  idea?  Why  or  why  not? 

21.7  Consider  the  types  of  ‘understanding’  needed:  1.  for  a natural  lan- 
guage interface  to  a database,  as  seen  in  Chapter  15.  2.  for  an  information 
extraction  program,  as  seen  in  Chapter  15.  3.  for  a MT  system.  Which  of 
these  requires  a deeper  understanding?  In  what  way? 

21.8  Choose  one  of  the  generation  techniques  introduced  in  Chapter  20 
and  explain  why  it  would  or  would  not  be  useful  for  MT. 

21.9  Version  1 (for  native  English  speakers):  Consider  the  following  sen- 
tence: 


These  lies  arc  like  their  father  that  begets  them;  gross  as  a mountain, 
open,  palpable. 

Henry  IV,  Paid  1,  act  2,  scene  2 

Translate  this  sentence  into  some  dialect  of  modern  vernacular  English. 
For  example,  you  might  translate  it  into  the  style  of  a New  York  Times  edi- 
torial or  an  Economist  opinion  piece,  or  into  the  style  of  your  favorite  tele- 
vision talk-show  host. 

Version  2 (for  native  speakers  of  other  languages):  Translate  the  fol- 
lowing sentence  into  your  native  language. 

One  night  my  friend  Tom,  who  had  just  moved  into  a new  apartment, 
saw  a cockroach  scurrying  about  in  the  kitchen. 

For  either  version,  now: 

a)  Describe  how  you  did  the  translation:  What  steps  did  you  perform? 
In  what  order  did  you  do  them?  Which  steps  took  the  most  time?  b)  Could 
you  write  a program  that  would  translate  using  the  same  methods  that  you 
did?  Why  or  why  not?  c)  What  aspects  were  hardest  for  you?  Would  they 
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be  hai'd  for  a MT  system?  d)  What  aspects  would  be  hardest  for  a MT  sys- 
tem? are  they  hai'd  for  people  too?  e)  Which  models  are  best  for  describing 
various  aspects  of  your  process  (direct,  transfer,  interlingua  or  statistical)?  f) 
Now  compare  your  translation  with  those  produced  by  friends  or  classmates. 
What  is  different?  Why  were  the  translations  different? 

21.10  Newspaper  reports  of  MT  systems  invariably  include  an  example  of 
a sentence,  typically  a proverb,  that  when  translated  from  English  to  lan- 
guage X,  and  then  back  to  English,  came  out  funny,  a)  Is  this  evidence  that 
at  least  one  of  the  two  MT  systems  was  bad?  b)  Why  does  this  problem  not 
arise  with  human  translators?  or  does  it?  c)  On  the  other  hand,  does  a suc- 
cessful translation  to  a foreign  language  and  back  indicate  that  the  system  is 
doing  well? 

21.11  Set  yourself  an  information  acquisition  task:  for  example,  to  find  a 
World-Wide  Web  page  in  your  favorite  foreign  language  reviewing  a recent 
movie,  and  discover  what  the  reviewer  thought.  Accomplish  this  task  using 
one  or  two  of  the  Web’s  machine  translation  providers,  a)  Give  two  exam- 
ples each  of  correct  and  incorrect  translations  you  encountered,  b)  Come  up 
with  a simple  quality  metric  for  rating  the  MT  output,  and  use  it  to  evalu- 
ate the  MT  systems  you  tried,  c)  Were  you  able  to  find  a page  of  the  kind 
you  wanted?  d)  Were  you  able  to  figure  out  whether  the  reviewer  liked  the 
movie?  e)  Were  the  scores  on  your  quality  metric  predictive  of  your  answers 
to  (c)  and  (d)? 

21.12  Consider  each  of  the  following  as  an  application  for  machine  trans- 
lation. Rank  the  difficulty  of  each  from  1 (easy)  to  4 (very  very  hai'd).  Also, 
for  each  task,  say  briefly  what  makes  it  easy  or  hai'd. 

a.  letters  between  an  American  girl  and  her  Chinese  pen-pal 

b.  electronic  junk  mail 

c.  articles  in  chemistry  journals 

d.  magazine  advertisements 

e.  children’s  storybooks 

f.  history  books 

g.  an  English-speaker  wanting  to  read  articles  in  Japanese  newsgroups 

h.  an  English-speaker  wanting  to  post  articles  to  a Japanese  newsgroup 


Perl 

grep 

MS  Word 

Description 

Single  character  expressions 

\... 

\... 

\... 

a special  character 

. 

. 

7 

any  single  character 

[...] 

[...] 

[...] 

any  single  character  listed 

any  single  character  in  the  range 

r...i 

r...i 

[!...] 

any  single  character  not  listed 

r...-...] 

any  single  character  not  in  the  range 

Anchors/Expressions  which  match  positions 

beginning  of  line 

$ 

$ 

$ 

end  of  line 

\b 

- 

- 

word  boundary 

\B 

- 

- 

word  non-boundary 

- 

\< 

< 

start  of  word 

- 

\> 

> 

end  of  word 

Counters/Expressions  which  quantify  previous  expressions 

* 

* 

- 

zero  or  more  of  previous  r.e. 

+ 

- 

@ 

one  or  more  of  previous  r.e. 

7 

- 

- 

exactly  one  or  zero  of  previous  r.e. 

{n} 

\ { n\ } 

{n} 

n of  previous  r.e. 

{ n,  m} 

\(n,m\( 

{ n,  m} 

from  n to  m of  previous  r.e. 

{n,  } 

\ {n,  \ } 

{n,  } 

at  least  n of  previous  r.e. 

Figure  A.l 

Basic  regular  expressions 
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Perl 

grep 

MS  Word 

Description 

Other 

* 

* 

* 

any  string  of  characters 

...  |... 

- 

- 

or  - matches  either  r.e. 

(...) 

\(...\) 

(...) 

grouping,  memory 

Shortcuts 

\d 

[0-9] 

[0-9] 

any  digit 

\D 

[“0-9] 

[“0-9] 

any  non-digit 

\w 

[a-zA-Z0-9,_1] 

[a-zA-Z0-9,_1] 

any  alphanumeric/space 

\W 

[ "'a-zA-Z0-91_1] 

[ "a-zA-Z0-91_1] 

any  non-alphanumeric 

\s 

[,_,\r\t\n\f  ] 

- 

whitespace  (space,  tab) 

\s 

[ “M\r\t\n\f  ] 

- 

non-whitespace 

Figure  A.2  More  regular  expressions 

THE  PORTER  STEMMING 
ALGORITHM 


For  the  purposes  of  the  Porter  (1980)  algorithm  we  define  a consonant  as  a 
letter  other  than  A,  E,  I,  O,  and  U,  and  other  than  Y preceded  by  a consonant. 
Any  other  letter  is  a vowel.  (This  is  of  course  just  an  orthographic  approxi- 
mation.) Let  c denote  a consonant  and  v denote  a vowel.  C will  stand  for  a 
string  of  one  or  more  consonants,  and  V for  a string  of  one  or  more  vowels. 
Any  written  English  word  or  word  paid  can  be  represented  by  the  follow- 
ing regular  expression  (where  the  parentheses  ()  arc  used  to  mark  optional 
elements): 

(C)(VC)m(V) 

For  example  the  word  troubles  maps  to  the  following  sequence: 

troubles 
C V C VC 


with  no  final  V.  We  call  the  Kleene  operator  m the  measure  of  any  word  or 
word  paid;  the  measure  correlates  very  roughly  with  the  number  of  syllables 
in  the  word  or  word  paid.  Some  examples: 


m=0 

m=l 

m=2 


TR,  EE,  TREE,  Y,  BY 
TROUBLE,  OATS,  TREES,  IVY 
TROUBLES,  PRIVATE,  OATEN,  ORRERY 


The  rules  that  we  will  present  below  will  all  be  in  the  following  format: 


(condition)  SI  — > S2 


meaning  “if  a word  ends  with  the  suffix  S 1 , and  the  stem  before  S 1 satisfies 
the  condition,  SI  is  replaced  by  S2”.  Conditions  include  the  following  and 
any  boolean  combinations  of  them: 
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m 

the  measure  of  the  stem 

*S 

the  stem  ends  with  S (and  similarly  for  other  letters) 

the  stem  contains  a vowel 

*d 

the  stem  ends  with  a double  consonant  (e.g.  -TT,  -SS) 

*o 

the  stem  ends  CVC,  where  the  second  c is 

not  W,  X,  or  Y (e.g.  -WIL,  -HOP) 

The  Porter  algorithm  consists  of  seven  simple  sets  of  rules,  applied  in 
order.  Within  each  step,  if  more  than  one  of  the  rules  can  apply,  only  the  one 
with  the  longest  matching  suffix  (SI)  is  followed. 

Step  1:  Plural  Nouns  and  Third  Person  Singular  Verbs 

The  rules  in  this  set  do  not  have  conditions: 


Step  2a:  Verbal  Past  Tense  and  Progressive  Forms 


(m>  1)  EED  ->  EE 

feed  — > feed 

agreed  — » agree 

(*v*)  ED  ->  e 

plastered  — > plaster 
bled  — > bled 

(*v*)  ING  ->•  e 

motoring  — > motor 
sing  — > sing 

Step  2b:  Cleanup 

If  the  second  or  third  of  the  rules  in  2a  is  successful,  we  run  the  following 
rules  (that  remove  double  letters  and  put  the  E back  on  -ATE/-BLE) 
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AT  ATE 

conflat(ed)  — > conflate 

BL  BLE 

troubl(ing)  — > trouble 

IZ  IZE 

siz(ed)  — » size 

(*d  & !(*L  or  *S  or  *Z))  — » single  letter 

hoppling)  — > hop 

tann(ed)  — > tan 

fall(ing)  fall 

hiss(ing)  — » hiss 

fizzled)  — > fizz 

(m=l  & *o)  — » E 

fail(ing)  — » fail 

filling)  — » file 

Step  3:  Y — > I 

(*v*)  Y — > I happy  — > happi 
sky  — > sky 


Step  4:  Derivational  Morphology  I:  Multiple  suffixes 


(m  > 0) 

ATIONAL 

-» 

ATE 

relational 

-» 

relate 

(m  > 0) 

TIONAL 

-» 

TION 

conditional 

-> 

condition 

rational 

-> 

rational 

(m  > 0) 

ENCI 

-» 

ENCE 

valenci 

-> 

valence 

(m  > 0) 

ANCI 

-> 

ANCE 

hesitanci 

-> 

hesitance 

(m  > 0) 

IZER 

-> 

IZE 

digitizer 

-> 

digitize 

(m  > 0) 

ABLI 

-> 

ABLE 

conformabli 

-> 

conformable 

(m  > 0) 

ALLI 

-> 

AL 

radicalli 

-» 

radical 

(m  > 0) 

ENTLI 

-> 

ENT 

differentli 

-> 

different 

(m  > 0) 

ELI 

-> 

E 

vileli 

-> 

vile 

(m  > 0) 

OUSLI 

-> 

OUS 

analogousli 

-> 

analogous 

(m  > 0) 

IZATION 

-> 

IZE 

vietnamization 

-> 

vietnamize 

(m  > 0) 

ATION 

-> 

ATE 

predication 

-> 

predicate 

(m  > 0) 

ATOR 

-> 

ATE 

operator 

-> 

operate 

(m  > 0) 

ALISM 

-» 

AL 

feudalism 

-> 

feudal 

(m  > 0) 

IVENESS 

-> 

IVE 

decisiveness 

-> 

decisive 

(m  > 0) 

FULNESS 

-> 

FUL 

hopefulness 

-> 

hopeful 

(m  > 0) 

OUSNESS 

-> 

OUS 

callousness 

-> 

callous 

(m  > 0) 

ALITI 

-> 

AL 

formaliti 

-> 

formal 

(m  > 0) 

IVITI 

-> 

IVE 

sensitiviti 

-> 

sensitive 

(m  > 0) 

BILITI 

-> 

BLE 

sensibiliti 

-> 

sensible 
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Step  5:  Derivational  Morphology  II:  More  multiple  suffixes 


(m  > 0)  ICATE  ->•  IC 
(m  > 0)  ATIVE  ->•  £ 
(m  > 0)  ALIZE  -)•  AL 
(m  > 0)  ICITI  -)•  IC 
(m  > 0)  FUL  -)•  £ 
(m  > 0)  NESS  -A  £ 


triplicate  - 

->  triplic 

formative  - 

->  form 

formalize  - 

->  formal 

electriciti  - 

->  electric 

hopeful 

->  hope 

goodness  - 

->  good 

Step  6:  Derivational  Morphology  III:  single  suffixes 


(m  > 1) 

AL 

-» 

£ 

revival 

-» 

reviv 

(m  > 1) 

ANCE 

-» 

£ 

allowance 

-» 

allow 

(m  > 1) 

ENCE 

-» 

£ 

inference 

-> 

infer 

(m  > 1) 

ER 

-» 

£ 

airliner 

-> 

airlin 

(m  > 1) 

IC 

-» 

£ 

gyroscopic 

-> 

gyroscop 

(m  > 1) 

ABLE 

-» 

£ 

defensible 

-> 

defens 

(m  > 1) 

ANT 

-» 

£ 

irritant 

-> 

irrit 

(m  > 1) 

EMENT 

-» 

£ 

replacement 

-» 

replac 

(m  > 1) 

MENT 

-» 

£ 

adjustment 

-> 

adjust 

(m  > 1) 

ENT 

-» 

£ 

dependent 

-> 

depend 

(m  > 1) 

(*S  or  *T)  & ION 

-» 

£ 

adoption 

-> 

adopt 

(m  > 1) 

OU 

-» 

£ 

homologou 

-> 

homolog 

(m  > 1) 

ISM 

-» 

£ 

communism 

-> 

commun 

(m  > 1) 

ATE 

-» 

£ 

activate 

-> 

activ 

(m  > 1) 

ITI 

-» 

£ 

angulariti 

-> 

angular 

(m  > 1) 

OUS 

-» 

£ 

homologous 

-> 

homolog 

(m  > 1) 

IVE 

-» 

£ 

effective 

-> 

effect 

(m  > 1) 

IZE 

-» 

£ 

bowdlerize 

-> 

bowdler 

Step  7a:  Cleanup 


(ill  >1)  E -A  £ 

probate  — > probat 
rate  — > rate 

(m  = 1 & ! *o)  E —5-  £ 

cease  — > ceas 

Step  7b:  Cleanup 


C5  AND  C7  TAGSETS 


c 


Tag 

Description 

Example 

AJO 

adjective  (unmarked) 

good , old 

AJC 

comparative  adjective 

better,  older 

AJS 

superlative  adjective 

best,  oldest 

ATO 

article 

the,  a,  an 

AVO 

adverb  (unmarked) 

often,  well,  longer,  furthest 

AVP 

adverb  particle 

up,  off,  out 

AVQ 

wh-adverb 

when,  how,  why 

CJC 

coordinating  conjunction 

and,  or 

CJS 

subordinating  conjunction 

although,  when 

CJT 

the  conjunction  that 

CRD 

cardinal  numeral  (except  one) 

3,  twenty-five,  734 

DPS 

possessive  determiner 

your,  their 

DTO 

general  determiner 

these,  some 

DTQ 

wh-determiner 

whose,  which 

EXO 

existential  there 

ITJ 

interjection  or  other  isolate 

oh,  yes,  mhm 

NNO 

noun  (neutral  for  number) 

aircraft,  data 

NN1 

singular  noun 

pencil,  goose 

NN2 

plural  noun 

pencils,  geese 

NPO 

proper  noun 

London,  Michael,  Mars 

ORD 

ordinal 

sixth,  77th,  last 

PNI 

indefinite  pronoun 

none,  everything 

PNP 

personal  pronoun 

you,  them,  ours 

PNQ 

wh-pronoun 

who,  whoever 

Figure  C.l  First  half  of  UCREL  C5  Tagset  for  the  British  National  Corpus 
(BNC)  after  Garside  et  al.  (1997). 
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Tag 

Description 

Example 

PNX 

reflexive  pronoun 

itself  ourselves 

POS 

possessive  ’s  or  ’ 

PRF 

the  preposition  of 

PRP 

preposition  (except  of) 

for,  above,  to 

PUL 

punctuation  - left  bracket 

(or  [ 

PUN 

punctuation  - general  mark 

! • • - 9 

• • 9 • 9 • • • • 

PUQ 

punctuation  - quotation  mark 

4 ? ?? 

PUR 

punctuation  - right  bracket 

) or  ] 

TOO 

infinitive  marker  to 

UNC 

unclassified  items  (not  English) 

VBB 

base  forms  of  be  (except  infinitive) 

am,  are 

VBD 

past  form  of  be 

was,  were 

VBG 

-ing  form  of  be 

being 

VBI 

infinitive  of  be 

VBN 

past  participle  of  be 

been 

VBZ 

-s  form  of  be 

is,  ’s 

VDB 

base  form  of  do(except  infinitive) 

does 

VDD 

past  form  of  do 

did 

VDG 

-ing  form  of  do 

doing 

VDI 

infinitive  of  do 

to  do 

VDN 

past  participle  of  do 

done 

VDZ 

-s  form  of  do 

does 

VHB 

base  form  of  have  (except  infinitive) 

have 

VHD 

past  tense  form  of  have 

had,  ’d 

VHG 

-ing  form  of  have 

having 

VHI 

infinitive  of  have 

VHN 

past  participle  of  have 

had 

VHZ 

-s  form  of  have 

has,  ’s 

VMO 

modal  auxiliary  verb 

can,  could,  will,  ’ll 

VVB 

base  form  of  lexical  verb  (except  infin.) 

take,  live 

VVD 

past  tense  form  of  lexical  verb 

took,  lived 

VVG 

-ing  form  of  lexical  verb 

taking,  living 

VVI 

infinitive  of  lexical  verb 

take,  live 

VVN 

past  participle  form  of  lex.  verb 

taken,  lived 

vvz 

-s  form  of  lexical  verb 

takes,  lives 

xxo 

the  negative  not  or  n ’t 

zzo 

alphabetical  symbol 

A,  B,  c,  d 

Figure  C.2  The  rest  of  UCREL’s  C5  Tagset  (Garside  et  al.,  1997). 

Tag 

Description 

Example 

j 

punctuation  tag  - exclamation  mark 
punctuation  tag  - quotation  marks 

( 

punctuation  tag  - left  bracket 

) 

punctuation  tag  - right  bracket 
punctuation  tag  - comma 

- 

punctuation  tag  - dash 

new  sentence  marker 
punctuation  tag  - full-stop 
punctuation  tag  - ellipsis 
punctuation  tag  - colon 
punctuation  tag  - semi-colon 

? 

punctuation  tag  - question-mark 

APPGE 

possessive  pronoun,  prenominal 

my,  your,  our  etc. 

AT 

article 

the,  no 

ATI 

singular  article 

a , an,  every 

BCL 

before-clause  marker 

in  order  [that] 

CC 

coordinating  conjunction 

and,  or 

CCB 

coordinating  conjunction 

but 

CS 

subordinating  conjunction 

if,  because,  unless 

CSA 

as  as  a conjunction 

CSN 

than  as  a conjunction 

CST 

that  as  a conjunction 

CSW 

whether  as  a conjunction 

DA 

post-determiner/pronoun 

such,  former,  same 

DAI 

singular  after-determiner 

little,  much 

DA2 

plural  after-determiner 

few,  several , many 

DAR 

comparative  after-determiner 

more,  less 

DAT 

superlative  after-determiner 

most,  least 

DB 

pre-determiner/pronoun 

all,  half 

DB2 

plural  pre-determiner/pronoun 

both 

DD 

determiner/pronoun 

any , some 

DD1 

singular  determiner 

this,  that,  another 

DD2 

plural  determiner 

these,  those 

DDQ 

wh-determiner 

which,  what 

DDQGE 

wh-determiner,  genitive 

whose 

DDQV 

wh-ever  determiner 

whichever,  whatever 

EX 

existential  there 

FO 

formula 

FU 

unclassified 

Figure  C.3  First  part  of  UCREL  C7  Tagset  for  the  British  National  Corpus 
(BNC)  from  (Garside  et  al. , 1997). 
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Tag 

Description 

Example 

FW 

foreign  word 

GE 

germanic  genitive  marker  - 

’ or  's 

IF 

for  as  a preposition 

II 

preposition 

in,  on,  to 

10 

of  as  a preposition 

IW 

with ; without  as  preposition 

JJ 

general  adjective 

big,  old 

JJR 

general  comparative  adjective 

older,  better,  bigger 

JJT 

general  superlative  adjective 

oldest,  best,  biggest 

JK 

adjective  catenative 

able  in  be  able  to 
willing  in  be  willing  to 

MC 

cardinal  number  (neutral  for  number) 

two,  three... 

MCI 

singular  cardinal  number 

one 

MC2 

plural  cardinal  number 

tens,  twenties 

MCMC 

hyphenated  number 

40-50,  1770-1827 

MD 

ordinal  number 

first,  2nd,  next,  last 

ND1 

singular  noun  of  direction 

north,  southeast 

NN 

common  noun  (neutral  for  number) 

sheep,  cod 

NN1 

singular  common  noun 

book,  girl 

NN2 

plural  common  noun 

books,  girls 

NNA 

following  noun  of  title 

M.A. 

NNB 

preceding  noun  of  title 

Mr,  Prof 

NNL1 

singular  locative  noun 

street.  Bay 

NNL2 

plural  locative  noun 

islands,  roads 

NNO 

numeral  noun  (neutral  for  number) 

dozen,  thousand 

NN02 

plural  numeral  noun 

hundreds,  thousands 

NNT 

temporal  noun  ( neutral  for  number) 

no  known  examples 

NNT1 

singular  temporal  noun 

day,  week,  year 

NNT2 

plural  temporal  noun 

days,  weeks,  years 

NNU 

unit  of  measurement 
( neutral  for  number) 

in cc. 

NNU1 

singular  unit  of  measurement 

inch,  centimetre 

NNU2 

plural  unit  of  measurement 

inches,  centimetres 

NP 

proper  noun  ( neutral  for  number) 

Phillipines , Mercedes 

NP1 

singular  proper  noun 

London,  Jane,  Frederick 

NP2 

plural  proper  noun 

Browns,  Reagans,  Koreas 

NPD1 

singular  weekday  noun 

Sunday 

NPD2 

plural  weekday  noun 

Sundays 

Figure  C.4  More  of  UCREL’s  C7  Tagset  (Garside  et  al.,  1997). 

Tag 

Description 

Example 

NPM1 

singular  month  noun 

October 

NPM2 

plural  month  noun 

Octobers 

PN 

indefinite  pronoun  (neutral  for  number) 

none 

PN1 

singular  indefinite  pronoun 

one,  everything,  nobody 

PNQO 

whom 

PNQS 

who 

PNQV 

whoever,  whomever 
whomsoever,  whosoever 

PNX1 

reflexive  indefinite  pronoun 

oneself 

PPGE 

nominal  possessive  personal  pronoun 

mine , yours 

PPH1 

it 

PPHOl 

him,  her 

PPH02 

them 

PPHS1 

She,  she 

PPHS2 

they 

PPIOl 

me 

PPI02 

us 

PPIS1 

i 

PPIS2 

we 

PPX1 

singular  reflexive  personal  pronoun 

yourself,  itself 

PPX2 

plural  reflexive  personal  pronoun 

yourselves,  ourselves 

PPY 

you 

RA 

adverb,  after  nominal  head 

else,  galore 

REX 

adverb  introducing 
appositional  constructions 

namely,  viz,  eg. 

RG 

degree  adverb 

very,  so,  too 

RGQ 

wh-  degree  adverb 

how 

RGQV 

wh-ever  degree  adverb 

however 

RGR 

comparative  degree  adverb 

more,  less 

RGT 

superlative  degree  adverb 

most,  least 

RL 

locative  adverb 

alongside,  forward 

RP 

prepositional  adverb;  particle 

in,  up,  about 

RPK 

prepositional  adverb,  catenative 

about  in  be  about  to 

RR 

general  adverb 

actually 

RRQ 

wh-  general  adverb 

where,  when,  why,  how 

RRQV 

wh-ever  general  adverb 

wherever,  whenever 

RRR 

comparative  general  adverb 

better,  longer 

RRT 

superlative  general  adverb 

best,  longest 

RT 

nominal  adverb  of  time 

now,  tommorow 

Figure  C.5  More  of  UCREL’s  C7  Tagset  (Garside  et  al.,  1997). 
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Tag 

Description 

Example 

TO 

infinitive  marker 

to 

UH 

interjection 

oh,  yes,  um 

VBO 

be 

VBDR 

were 

VBDZ 

was 

VBG 

being 

VBI 

infinitive  be 

VBM 

am 

VBN 

been 

VBR 

are 

VBZ 

is 

VDO 

do 

VDD 

did 

VDG 

doing 

VDI 

infinitive  do 

VDN 

done 

VDZ 

does 

VHO 

have 

VHD 

past  tense  had 

VHG 

having 

VHI 

infinitive  have 

VHN 

past  participle  had 

VHZ 

has 

VM 

modal  auxiliary 

can,  will,  would  etc. 

VMK 

modal  catenative 

ought,  used 

WO 

base  form  of  lexical  verb 

give,  work  etc. 

VVD 

past  tense  form  of  lexical  verb 

gave,  worked  etc. 

VVG 

-ing  form  of  lexical  verb 

giving,  working  etc. 

VVGK 

-ing  form  in  a catenative  verb 

going  in  be  going  to 

VVI 

infinitive  of  lexical  verb 

[to]  give,  [to]  work  etc. 

VVN 

past  participle  form  of  lexical  verb 

given,  worked  etc. 

VVNK 

past  part,  in  a catenative  verb 

bound  in  be  bound  to 

vvz 

-s  form  of  lexical  verb 

gives,  works  etc. 

XX 

not,  n ’t 

ZZ1 

singular  letter  of  the  alphabet 

A,  a,  B,  etc. 

772 

plural  letter  of  the  alphabet 

As,  b’s,  etc. 

Figure  C.6  The  rest  of  UCREL’s  C7  Tagset  (Garside  et  al.,  1997) 

TRAINING  HMMS:  THE 
FORWARD-BACKWARD 
ALGORITHM 


This  appendix  sketches  the  forward-backward  or  Baum- Welch  algorithm 
(Baum,  1972),  a special  case  of  the  Expectation-Maximization  or  EM  al- 
gorithm (Dempster  el  ah,  1977).  The  algorithm  will  let  us  train  the  transition 
probabilities  cijj  and  the  emission  probabilities  bj(ot)  of  the  HMM.  While  it 
is  theoretically  possible  to  train  both  the  network  structure  of  an  HMM  and 
these  probabilities,  no  good  algorithm  for  this  double-induction  exists.  Thus 
in  practice  the  structure  of  most  HMMs  is  designed  by  hand,  and  then  the 
transition  and  emission  probabilities  arc  trained  from  a large  set  of  observa- 
tion sequences  O.  Furthermore,  it  turns  out  that  the  problem  of  setting  the 
a and  b parameters  so  as  to  exactly  maximize  the  probability  of  the  obser- 
vation sequence  O is  unsolved.  The  algorithm  that  we  give  in  this  section  is 
only  guaranteed  to  find  a local  maximum.  The  for  ward- hack  ward  algorithm 
is  used  throughout  speech  and  language  processing,  for  example  in  training 
HMM-based  part-of-speech  taggers,  as  we  saw  in  Chapter  8.  Extensions  of 
forward-backward  arc  also  important,  like  the  Inside-Outside  algorithm  used 
to  train  stochastic  context-free-grammars  (Chapter  12). 

Let  us  begin  by  imagining  that  we  were  training  not  a Hidden  Markov 
Model  but  a vanilla  Markov  Model.  We  do  this  by  running  the  model  on  the 
observation  and  seeing  which  transitions  and  observations  were  used.  For 
ease  of  description  in  the  rest  of  this  section,  we  will  pretend  that  we  arc 
training  on  a single  sequence  of  training  data  (called  O),  but  of  course  in 
a real  speech  recognition  system  we  would  train  on  hundreds  of  thousands 
of  sequences  (thousands  of  sentences).  Since  unlike  an  HMM,  a vanilla 
Markov  Model  is  not  hidden,  we  can  look  at  an  observation  sequence  and 
know  exaedy  which  transitions  we  took  through  the  model,  and  which  state 
generated  each  observation  symbol.  Since  every  state  can  only  generate  one 
observation  symbol,  the  observation  b probabilities  arc  all  1.0.  The  proba- 
bility aij  of  a particular  transition  between  states  i and  j can  be  computed  by 
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FORWARD 

PROBABILITY 

BACKWARD 

PROBABILITY 


counting  the  number  of  times  the  transition  was  taken,  which  we  could  call 
C(i  — > j ),  and  then  normalizing  by  the  total  count  of  all  times  we  took  any 
transition  from  state  i. 


i — 


LqeQdi  f q) 


(D.l) 


For  an  HMM  we  cannot  compute  these  counts  directly  from  an  ob- 
served sentence  (or  set  of  sentences),  since  we  don’t  know  which  path  of 
states  was  taken  through  the  machine  for  a given  input.  The  Baum-Welch 
uses  two  neat  intuitions  to  solve  this  problem.  The  first  idea  is  to  iteratively 
estimate  the  counts.  We  will  start  with  an  estimate  for  the  transition  and  ob- 
servation probabilities,  and  then  use  these  estimated  probabilities  to  derive 
better  and  better  probabilities.  The  second  idea  is  that  we  get  our  estimated 
probabilities  by  computing  the  forward  probability  for  an  observation  and 
then  dividing  that  probability  mass  among  all  the  different  paths  that  con- 
tributed to  this  forward  probability. 

In  order  to  understand  the  algorithm,  we  need  to  return  to  the  forward 
algorithm  of  Chapter  5 and  more  formally  define  two  related  probabilities 
which  will  be  used  in  computing  the  final  probability:  the  forward  proba- 
bility and  the  backward  probability.  We  refer  to  the  forward  probability 
as  a and  the  backward  probability  as  [3.  Recall  that  we  defined  the  forward 
probability  as  the  probability  of  being  in  state  i after  seeing  the  first  t obser- 
vations, given  the  automaton  X: 


a,  O')  =P{ouo2...ot,qt  =i  |X) 


(D.2) 


In  Chapter  5 we  used  a matrix  to  calculate  the  forward  probability  re- 
cursively; now  we  will  formally  define  the  actual  recursion. 


1.  Initialization: 


a/,(l)  = a\  jbj(o\)  1 < j <N 
2.  Recursion  (since  states  1 and  N are  non-emitting): 
[n-  1 

OC;  ( t 1 ) Ujj 


0,0)  = 

3.  Termination: 


i= 2 


bj(ot)  1 < j < N,l  < t < T 


(D.3) 

(D.4) 


N- 1 

P{0\X)  =<xn(T)  = £ ai(T)aiN  (D.5) 

i—2 

As  we  saw  in  Chapter  5,  the  forward  probability  is  computed  via  a 
matrix  or  lattice,  in  which  each  column  is  computed  by  extending  the  paths 


from  the  previous  columns.  Figure  D.  1 illustrates  the  induction  step  for  com- 
puting the  value  in  one  new  cell. 


Figure  D.l  The  computation  of  a ,-(f)  by  summing  all  the  previous  values 
a,_i  weighted  by  their  transition  probabilities  a and  multiplying  by  the  obser- 
vation probability  bj(ot+ 1 ).  Of  course  in  any  given  HMM  many  or  most  of  the 
transition  probabilities  will  be  0,  so  not  all  previous  states  will  contribute  to 
the  forward  probability  of  the  current  state. 

The  second  important  piece  of  the  forward-backward  algorithm,  the 
backward  probability,  is  almost  the  mirror  image  of  the  forward  probability; 
it  computes  the  probability  of  seeing  the  observations  from  time  / + I to  the 
end,  given  that  we  arc  in  state  j at  time  t (and  of  course  given  the  automaton 
X): 

PiK)  =P(ot+i,Ot+2---oT\qt  =j,ty  (D.6) 

It  is  computed  inductively  in  a similar  manner  to  the  forward  algorithm. 

1.  Initialization: 


Pi(f)=%,l<i<JV  (D.7) 

2.  Recursion  (again  since  states  1 and  N arc  non-emitting): 

N- 1 

P/(0  = £ aijbj(ot+l)^j(t  + l)  Ki<N,T>t>l  (D.8) 

i=  2 

3.  Termination: 

P(0\X)  = un(T)  = Pi(T)  = £ ^>^(1) 

7=2 


(D.9) 
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Figure  D.2  illustrates  the  backward  induction  step. 

We  arc  now  ready  to  understand  how  the  forward  and  backward  prob- 
abilities can  help  us  compute  the  transition  probability  a(/-  and  observation 
probability  b,  (o, ) from  an  observation  sequence,  even  though  the  actual  path 
taken  through  the  machine  is  hidden! 

Let’s  begin  by  showing  how  to  reestimate  a^.  We  will  proceed  to  esti- 
mate dij  by  a valiant  of  (D.l): 

„ expected  number  of  transitions  from  state  i to  state  j 

an  = (D.10) 

expected  number  of  transitions  from  state  i 

How  do  we  compute  the  numerator?  Here’s  the  intuition.  Assume 
we  had  some  estimate  of  the  probability  that  a given  transition  i — > j was 
taken  at  a particular  point  in  time  t in  the  observation  sequence.  If  we  knew 
this  probability  for  each  particular  time  t,  we  could  sum  over  all  times  t to 
estimate  the  total  count  for  the  transition  / — > j. 

More  formally,  let’s  define  the  probability  xt  (x  for  transition)  as  the 
probability  of  being  in  state  i at  time  t and  state  j at  time  t + 1,  given  the 
observation  sequence  and  of  course  the  model: 

(/,;')  =P(qt=i,qt+ 1 =j\O.X)  (D.ll) 

In  order  to  compute  xf,  we  first  compute  a probability  which  is  similar 
to  Xt,  but  differs  in  including  the  probability  of  the  observation: 

not-quite-T,  =P{q,  = i,qt+]  =j,0\X)  (D.12) 

Figure  D.3  shows  the  various  probabilities  that  go  into  computing  not- 
quitc-x, : the  transition  probability  for  the  arc  in  question,  the  a probability 


before  the  arc.  the  P probability  after  the  arc,  and  the  observation  probability 
for  the  symbol  just  after  the  arc. 


Figure  D.3  Computation  of  the  joint  probability  of  being  in  state  i at  time 
t and  state  j at  time  t + 1 . The  figure  shows  the  various  probabilities  that  need 
to  be  combined  to  produce  P(qt  = i,qt+ 1 = j,0\X):  the  a and  p probabilities, 
the  transition  probability  aq  and  the  observation  probability  bj(ot+ 1 ).  After 
Rabiner  (1989). 


(D.  14) 


These  arc  multiplied  together  to  produce  not-quite-* xt  as  follows 

not-quite-x,  = al(t)atjbj(ot+])fij(t  + \ ) (D.13) 

In  order  to  compute  x,  from  not-quite-x,,  the  laws  of  probability  in- 
struct us  to  divide  by  F((?|X),  since: 

The  probability  of  the  observation  given  the  model  is  simply  the  for- 
ward probability  of  the  whole  utterance,  (or  alternatively  the  backward  prob- 
ability of  the  whole  utterance!),  which  can  thus  be  computed  in  a number  of 
ways: 

N 

P(0\'k)  = aN(T)  = Vn(T)  = £<X;(f)P;(f)  (D.15) 

7=1 

So,  the  final  equation  for  x,  is: 

(D.16) 

The  expected  number  of  transitions  from  state  i to  state  j is  then  the 
sum  over  all  t of  x.  For  our  estimate  of  a/j  in  (D.10),  we  just  need  one  more 


(D.16) 
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thing:  the  total  expected  number  of  transitions  from  state  i.  We  can  get  this 
by  summing  over  all  transitions  out  of  state  i.  Here’s  the  final  formula  for 


e,=i  'zU't&j) 


(D.17) 


We  also  need  a formula  for  recomputing  the  observation  probability. 
This  is  the  probability  of  a given  symbol  ly  from  the  observation  vocabulary 
V,  given  a state  j:  bj(vk).  We  will  do  this  by  frying  to  compute: 


expected  number  of  times  in  state  j and  observing  symbol  Vk 
expected  number  of  times  in  state  j 


(D. 


For  this  we  will  need  to  know  the  probability  of  being  in  state  j at  time 
t,  which  we  will  call  Cj(t)  (a  for  state): 


Cj{t)=P{qt  = j\0,X) 


(D.  19) 


Once  again,  we  will  compute  this  by  including  the  observation  se- 
quence in  the  probability: 


Gj(t) 


P(qt=j,Q\X) 

P(0\l) 


(D.20) 


As  Figure  D.4  shows,  the  numerator  of  (D.20)  is  just  the  product  of  the 
forward  probability  and  the  backward  probability: 


m 0/(0  MO 

’ P(0\X) 


(D.21) 
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We  arc  ready  to  compute  b.  For  the  numerator,  we  sum  Gj(t)  for  all 
time  steps  t in  which  the  observation  o,  is  the  symbol  vy  that  we  are  interested 
in.  For  the  denominator,  we  sum  Gj(t)  over  all  time  steps  t.  The  result  will 
be  the  percentage  of  the  times  that  we  were  in  state  j that  we  saw  symbol  Vk 
(the  notation  Yl=is  t o,=vk  means  ’’sum  over  all  t for  which  the  observation  at 
time  t was  vy): 


bj(vk)  = 


^t=ls.t  ,Ot  =vk  ® j (0 

L/-i°/(0 


(D.22) 


We  now  have  ways  to  re-estimate  the  transition  a and  observation  b 
probabilities  from  an  observation  sequence  O assuming  that  we  already  have 
a previous  estimate  of  a and  b.  The  entire  training  procedure  for  HMMs, 
called  embedded  training,  first  chooses  some  estimate  for  a and  b , and  then 
uses  equations  (D.22)  and  (D.  17)  to  re-estimate  a and  b,  and  the  repeats  until 
convergence.  In  the  next  sections  we  will  see  how  forward-backward  is  ex- 
tended to  inputs  which  are  non-discrete  (‘continuous  observation  densities’) 
via  Gaussian  functions.  Section  7.7  discussed  how  the  embedded  training 
algorithm  gets  its  initial  estimates  for  a and  b. 


Continuous  Probability  Densities 

The  version  of  the  parameter  reestimation  that  we  have  described  so  far  sec- 
tion assumes  that  the  input  observations  were  discrete  symbols  from  some 
reasonably-sized  alphabet.  This  is  naturally  true  for  some  uses  of  HMMs; 
for  example  Chapter  8 will  introduce  the  use  of  HMMs  for  part-of-speech- 
tagging.  Here  the  observations  arc  words  of  English,  which  is  a reasonably- 
sized  finite  set,  say  approximately  100K  words.  For  speech  recognition,  the 
LPC  cepstral  features  that  we  introduced  constitute  a much  larger  alphabet 
(11  features,  each  one  say  a 32-bit  floating-point  number),  for  a total  vo- 
cabulary size  of  2(  l 1*32).  in  fact,  since  in  practice,  we  usually  use  not  11 
features,  but  delta-features  and  double-delta  features  as  well,  the  vocabulary 
size  would  be  enormous.  Chapter  7 mentioned  that  one  way  to  solve  this 
problem  is  to  cluster  or  vector  quantize  the  cepstral  features  into  a much 
smaller  set  of  discrete  observation  symbols.  A more  effective  approach  is  to 
use  either  mixtures  of  Gaussian  estimators  neural  networks  (multi-layer 
perceptrons)  to  estimate  a probability  density  function  or  pdf  over  a con- 
tinuous space,  as  we  suggested  in  Chapter  7. 

HMMs  with  Gaussian  observation-probability-estimators  arc  trained 
by  a simple  extension  to  the  forward-backward  algorithm.  Recall  from  Chap- 
ter 7 that  in  the  simplest  use  of  Gaussians,  we  assume  that  the  possible  values 
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of  the  observation  feature  vector  ot  arc  normally  distributed,  and  so  we  rep- 
resent the  observation  probability  function  bj  (o, ) as  a Gaussian  curve  with 
mean  vector  p;-  and  covariance  matrix  (prime  denotes  vector  transpose): 


bj{ot)  = 1 gtfo-wlVfo-wJ]  (D.23) 

V(27l)IIil 

Usually  we  make  the  simplifying  assumption  that  the  covariance  ma- 
trix Y,j  is  diagonal,  which  means  that  in  practice  we  arc  keeping  a single 
separate  mean  and  variance  for  each  feature  in  the  feature  vector. 

How  arc  the  mean  and  covariance  of  the  Gaussians  estimated?  It  is 
helpful  again  to  consider  the  simpler  case  of  a non-hidden  Markov  Model, 
with  only  one  state  i.  The  vector  of  feature  means  p and  the  vector  of  covari- 
ances £ could  then  be  estimated  by  averaging: 


1 T 

A = rL°t  (D-24) 

1 t= t 

% = ^X>r-^)'(G-A/)]  (D.25) 

1 t=  1 

But  since  there  arc  multiple  hidden  states,  we  don’t  know  which  ob- 
servation vector  ot  was  produced  by  which  state.  What  we  would  like  to 
do  is  assign  each  observation  vector  ot  to  every  possible  state  /,  prorated  by 
the  probability  that  the  HMM  was  in  state  i at  time  t.  Luckily,  we  already 
know  how  to  do  this  prorating;  the  probability  of  being  in  state  i at  time  t is 
07(f),  which  we  saw  how  to  compute  above!  Of  course  we’ll  need  to  do  the 
probability  computation  of  a,  (?)  iteratively  since  getting  a better  observation 
probability  b will  also  help  us  be  more  sure  of  the  probability  a of  being  in 
a state  at  a certain  time.  So  the  actual  re-estimation  equations  arc: 


Yj=\  g/(0(o -Vi )'(ot -n) 
lf=t°/(0 


(D.26) 

(D.27) 


The  sums  in  the  denominators  arc  for  the  same  normalization  that  we 
saw  in  (D.22).  Equations  (D.27)  and  (D.27)  arc  then  used  in  the  forward- 
backward  (Baum-Welch)  training  of  the  HMM.  The  values  of  p,  and  a,  arc 
first  set  to  some  initial  estimate,  which  is  then  re-estimated  until  the  numbers 
converge. 
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See  Jelinek  (1997)  or  Rabiner  and  Juang  (1993)  for  a more  complete 
description  of  the  forward-backward  algorithm.  Jelinek  (1997)  also  shows 
the  relationship  between  forward-backward  and  EM. 
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A*  decoder 
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A*  evaluation  function,  254 
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Abeille,  A.,  455,  470 
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